web scraper

contents

  • logging
  • data base access
  • solr indexing
  • filesystem access
  • web scraping

logging

Data base access

– mysql in python


import mysql.connector
# from mysql.connector import Error

# pip3 install mysql-connector
# https://dev.mysql.com/doc/connector-python/en/connector-python-reference.html

class DB():
    def __init__(self, config):
        self.connection = None
        self.connection = mysql.connector.connect(**config)
        
    def query(self, sql, args):
        cursor = self.connection.cursor()
        cursor.execute(sql, args)
        return cursor

    def insert(self,sql,args):
        cursor = self.query(sql, args)
        id = cursor.lastrowid
        self.connection.commit()
        cursor.close()
        return id

    # https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursor-executemany.html
    def insertmany(self,sql,args):
        cursor = self.connection.cursor()
        cursor.executemany(sql, args)
        rowcount = cursor.rowcount
        self.connection.commit()
        cursor.close()
        return rowcount

    def update(self,sql,args):
        cursor = self.query(sql, args)
        rowcount = cursor.rowcount
        self.connection.commit()
        cursor.close()
        return rowcount

    def fetch(self, sql, args):
        rows = []
        cursor = self.query(sql, args)
        if cursor.with_rows:
            rows = cursor.fetchall()
        cursor.close()
        return rows

    def fetchone(self, sql, args):
        row = None
        cursor = self.query(sql, args)
        if cursor.with_rows:
            row = cursor.fetchone()
        cursor.close()
        return row

    def __del__(self):
        if self.connection != None:
            self.connection.close()

  # write your function here for CRUD operations

solr indexing

filesystem access

web scraping

solr – managed-schema field definitions

name type description active flags deactive flags
ignored_* string catchall for all undefined metadata multiValued
id string unique id field stored, required multiValued
_version_ plong internal solr field indexed, stored
text text_general content field for facetting multiValued docValues, stored
content text_general main content field as extracted by tika stored, multiValued, indexed docValues
author string author retrieved from tika multiValued, indexed, docValues stored
*author string dynamic field for authors retrieved from tika multiValued, indexed, docValues stored
title string title retrieved from tika multiValued, indexed, docValues stored
*title string dynamic title field retrieved from tika multiValued, indexed, docValues stored
date string date retrieved from tika multiValued, indexed, docValues stored
content_type plongs content_type retrieved from tika multiValued, indexed, docValues stored
stream_size string stream_size retrieved from tika multiValued, indexed, docValues stored
cat string category defined by user through manifoldcf multiValued, docValues stored

Additional copyField statements to insert data in fields:

  • source=”content” dest=”text”
  • source=”*author” dest=”author”
  • source=”*title” dest=”title”

solr search server with tika and manifoldcf

I finally managed to get my search server running using solr as main engine and tika for extraction. The setup is competed by a manifoldcf for access to files, emails, wiki, rss and web.

solr

A short overview on the basic file structure of solr is shown below:

filestructure


<solr-home-directory/
solr.xml
core_name1/
core.properties
conf/
solrconfig.xml
managed-schema
data/

And here is my core.properties file without cloud on a single server and very basic as well.

core.properties


Name=collection name
Config=solrconfig.xml
dataDir=collection name/data

schema fields from tika

The following fields are essential for my setup:

  • id – the identifier unique for solr
  • _version_ – also some internal stuff for solr
  • content – the text representation of the extraction results from tika
  • ignored_* – as a catchall for any metadata that is not covered by a field in the index

The solr install is following the instructions given by the project team. As I am using debian the solr.in.sh is barely standard. Here are the settings:


SOLR_PID_DIR="/var/solr"
SOLR_HOME="/var/solr/data"
LOG4J_PROPS="/var/solr/log4j2.xml"
SOLR_LOGS_DIR="/var/solr/logs"
SOLR_PORT="8983"

Solr is started via old init.d style script from the project team. No modifications here.

The specific managed-schema and solrconfig.xml files are not listed here but took the most time to get them running. Some comments:

  • grab some information on the metadata extracted by tika to find the fields that should be worth a second look
  • check for the configuration given in /var/solr/data/conf/
  • especially the solr log at /var/solr/logs/solr.log
  • managed-schema shoud be adjusted for the metadata retrived through tika
  • delete any old collection files by removing /var/solr/data/collection name/collection name/index/
  • solr cell is responsible for importing/indexing files in foreign formats like PDF, Word, etc
  • set stored false as often as possible
  • set indexed false as much as possible
  • remove copyfields as far as possible
  • set indexed false for text_general
  • use catchall field for indexing
  • start JVM in server mode
  • set logging on higher level only
  • integrate everything in tomcat
  • set indexed or docValues to true but not both
  • some field type annottations: Solr Manual 8.11

some interesting commands

  • /bin/solr start
  • /bin/solr stop -all
  • /bin/post -c collection input
  • /bin/solr delete -c collection
  • /bin/solr create -c collection -d configdir
  • velocity setup

    velocity may be used as a search interface for solr but my setup is not completed yet.

    tika

    The tika server version is also installed as described by the project team. I only added a start script for systemd as follows:


    [Unit]
    Description=Apache Tika Server
    After=network.target

    [Service]
    Type=simple
    User=tika
    Environment="TIKA_INCLUDE=/etc/default/tika.in.sh"
    ExecStart=/usr/bin/java -jar /opt/tika/tika-server-standard-2.3.0.jar --port 9998 --config /opt/tika/tika-config.xml
    Restart=always

    [Install]
    WantedBy=multi-user.target

    The tika.in.sh is once again copied from project team suggestion without modifications:


    TIKA_PID_DIR="/var/tika"
    LOG4J_PROPS="/var/tika/log4j.properties"
    TIKA_LOGS_DIR="/var/tika/logs"
    TIKA_PORT="9998"
    TIKA_FORKED_OPTS=""

    The tika-config.xml is quit empty at the moment but I hope to get logging running soon.

    ManifoldCF

    And finally the manifoldcf installation from scratch as the interface to the various information resources.

    and here is my systemd start script:

    [Unit]
    Description=ManifoldCF service
    [Service]
    WorkingDirectory=/opt/manifoldcf/example
    ExecStart=/usr/bin/java -Xms512m -Xmx512m -Dorg.apache.manifoldcf.configfile=./properties.xml -Dorg.apache.manifoldcf.jettyshutdowntoken=secret_token -Djava.security.auth.login.config= -cp .:../lib/mcf-core.jar:../lib/mcf-agents.jar:../lib/mcf-pull-agent.jar:../lib/mcf-ui-core.jar:../lib/mcf-jetty-runner.jar:../lib/jetty-client-9.4.25.v20191220.jar:../lib/jetty-continuation-9.4.25.v20191220.jar:../lib/jetty-http-9.4.25.v20191220.jar:../lib/jetty-io-9.4.25.v20191220.jar:../lib/jetty-jndi-9.4.25.v20191220.jar:../lib/jetty-jsp-9.2.30.v20200428.jar:../lib/jetty-jsp-jdt-2.3.3.jar:../lib/jetty-plus-9.4.25.v20191220.jar:../lib/jetty-schemas-3.1.M0.jar:../lib/jetty-security-9.4.25.v20191220.jar:../lib/jetty-server-9.4.25.v20191220.jar:../lib/jetty-servlet-9.4.25.v20191220.jar:../lib/jetty-util-9.4.25.v20191220.jar:../lib/jetty-webapp-9.4.25.v20191220.jar:../lib/jetty-xml-9.4.25.v20191220.jar:../lib/commons-codec-1.10.jar:../lib/commons-collections-3.2.2.jar:../lib/commons-collections4-4.2.jar:../lib/commons-discovery-0.5.jar:../lib/commons-el-1.0.jar:../lib/commons-exec-1.3.jar:../lib/commons-fileupload-1.3.3.jar:../lib/commons-io-2.5.jar:../lib/commons-lang-2.6.jar:../lib/commons-lang3-3.9.jar:../lib/commons-logging-1.2.jar:../lib/ecj-4.3.1.jar:../lib/gson-2.8.0.jar:../lib/guava-25.1-jre.jar:../lib/httpclient-4.5.8.jar:../lib/httpcore-4.4.10.jar:../lib/jasper-6.0.35.jar:../lib/jasper-el-6.0.35.jar:../lib/javax.servlet-api-3.1.0.jar:../lib/jna-5.3.1.jar:../lib/jna-platform-5.3.1.jar:../lib/json-simple-1.1.1.jar:../lib/jsp-api-2.1-glassfish-2.1.v20091210.jar:../lib/juli-6.0.35.jar:../lib/log4j-1.2-api-2.4.1.jar:../lib/log4j-api-2.4.1.jar:../lib/log4j-core-2.4.1.jar:../lib/mail-1.4.5.jar:../lib/serializer-2.7.1.jar:../lib/slf4j-api-1.7.25.jar:../lib/slf4j-simple-1.7.25.jar:../lib/velocity-1.7.jar:../lib/xalan-2.7.1.jar:../lib/xercesImpl-2.10.0.jar:../lib/xml-apis-1.4.01.jar:../lib/zookeeper-3.4.10.jar:../lib/javax.activation-1.2.0.jar:../lib/javax.activation-api-1.2.0.jar: -jar start.jar
    User=solr
    Type=simple
    SuccessExitStatus=143
    TimeoutStopSec=10
    Restart=on-failure
    RestartSec=10
    [Install]
    WantedBy=multi-user.target