web scraper

contents

  • logging
  • data base access
  • solr indexing
  • filesystem access
  • web scraping

logging

Data base access

– mysql in python


import mysql.connector
# from mysql.connector import Error

# pip3 install mysql-connector
# https://dev.mysql.com/doc/connector-python/en/connector-python-reference.html

class DB():
    def __init__(self, config):
        self.connection = None
        self.connection = mysql.connector.connect(**config)
        
    def query(self, sql, args):
        cursor = self.connection.cursor()
        cursor.execute(sql, args)
        return cursor

    def insert(self,sql,args):
        cursor = self.query(sql, args)
        id = cursor.lastrowid
        self.connection.commit()
        cursor.close()
        return id

    # https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursor-executemany.html
    def insertmany(self,sql,args):
        cursor = self.connection.cursor()
        cursor.executemany(sql, args)
        rowcount = cursor.rowcount
        self.connection.commit()
        cursor.close()
        return rowcount

    def update(self,sql,args):
        cursor = self.query(sql, args)
        rowcount = cursor.rowcount
        self.connection.commit()
        cursor.close()
        return rowcount

    def fetch(self, sql, args):
        rows = []
        cursor = self.query(sql, args)
        if cursor.with_rows:
            rows = cursor.fetchall()
        cursor.close()
        return rows

    def fetchone(self, sql, args):
        row = None
        cursor = self.query(sql, args)
        if cursor.with_rows:
            row = cursor.fetchone()
        cursor.close()
        return row

    def __del__(self):
        if self.connection != None:
            self.connection.close()

  # write your function here for CRUD operations

solr indexing

filesystem access

web scraping

solr – managed-schema field definitions

name type description active flags deactive flags
ignored_* string catchall for all undefined metadata multiValued
id string unique id field stored, required multiValued
_version_ plong internal solr field indexed, stored
text text_general content field for facetting multiValued docValues, stored
content text_general main content field as extracted by tika stored, multiValued, indexed docValues
author string author retrieved from tika multiValued, indexed, docValues stored
*author string dynamic field for authors retrieved from tika multiValued, indexed, docValues stored
title string title retrieved from tika multiValued, indexed, docValues stored
*title string dynamic title field retrieved from tika multiValued, indexed, docValues stored
date string date retrieved from tika multiValued, indexed, docValues stored
content_type plongs content_type retrieved from tika multiValued, indexed, docValues stored
stream_size string stream_size retrieved from tika multiValued, indexed, docValues stored
cat string category defined by user through manifoldcf multiValued, docValues stored

Additional copyField statements to insert data in fields:

  • source=”content” dest=”text”
  • source=”*author” dest=”author”
  • source=”*title” dest=”title”