CISRI, Cogniva & uOttawa ESIS Discussion at ARMA NCR 2013 FALL IM DAYS

Join us November 19th & 20th as ARMA NCR hosts the 2013 Fall IM Days Conference: the annual hub for the Information Management community in the National Capital.

Cogniva, CISRI & uOttawa ESIS will be joining forces to present a session entitled: Business Process Analysis & Automatic Classification of Email of Business Value: November 20th, 9:30 am at the Brookstreet Hotel in Kanata, Ottawa.

Join Yves Marleau, Craig Eby, Inges Alberts & André Vellino as they discuss the challenges associated with the automated identification of email of business value in a governmental context. The management of email is an essential stepping-stone on the path toward good governance and long-term business performance for the Canadian Government – come and see how your organization can benefit!

Read more: IM Fall Days Agenda

@CognivaNews @CognivaResearch

Automated Taxonomy Discovery

Here at CISRI we are excited to be engaged in a research project on automated concept and taxonomy discovery. This research is related to an IRAP grant which Cogniva Information Solutions received. The goal of the project is to simplify the creation of metadata taxonomies and the relations between the concepts using text analytics approaches. We are using a variety of open source tools and research methodologies. We will be posting updates and technical details as the research progresses, so stay tuned!

How to Set up Solr and ManifoldCF on an Ubuntu Based Computer

How to Set up Solr and ManifoldCF on an Ubuntu Based Computer

This blog post is intended to provide some guidance on how to set up a computer to run Apache Solr (http://lucene.apache.org/solr/) and Apache ManifoldCF (http://manifoldcf.apache.org/).  Solr is a wrapper for Lucene.  It provides a web UI and a variety of features such as document text extraction (via Apache Tika).  ManifoldCF is a utility for scheduling jobs and providing repository connectors.  We have used it to import documents from both Windows (CIFS) file share and MS SharePoint 2010 into Solr.

This guide was written while installing and configuring Solr and ManifoldCF on a VirtualBox virtual machine running Linux Mint 15 (Mate) x64 (http://www.linuxmint.com/).  I chose Linux Mint because it is a “hot” GNU/Linux distribution these days (http://distrowatch.com/dwres.php?resource=major).  These instructions can be used to install/configure Sorl and ManifoldCF on Ubuntu.  You just need to be aware that the standard text editor on Mate in pluma and on gnome its gedit.  So, anywhere you see ‘pluma’ below substitute ‘gedit’ for ubuntu.  These instructions should also work on Debian, but I have not verified this to be the case (substitute ‘gedit’ for ‘pluma’).

The development of this guide was a joint effort of Chris Salter and myself.

Install Solr

  • Download Solr 4.3.1
  • Decompress and move the solr-4.3.1 directory to /usr/share/solr
  • To do this via the terminal
    • cd ~/Downloads
    • wget http://archive.apache.org/dist/lucene/solr/4.3.1/solr-4.3.1.tgz
    • tar -xzvf solr-4.3.1.tgz
    • sudo cp -R solr-4.3.1 /usr/share/solr
  • Test Solr
    • Open a terminal
      • sudo java -jar /usr/share/solr/example/start.jar
    • Open another terminal
      • cd exampledocs
      • java -jar post.jar *.xml
    • Confirm that you see something like Figure 1.

Solr Test
Figure 1: Solr test

 

Solr WebUI
Figure 2: Solr WebUI

Install ManifoldCF

  • Download ManifoldCF 1.3
  • Decompress and move the apache-manifoldcf-1.3 directory to /usr/share/manifoldcf
  • To do this via the terminal
    • cd ~/Downloads
    • wget http://apache.mirror.rafal.ca/manifoldcf/apache-manifoldcf-1.3-bin.tar.gz
    • tar -xzvf apache-manifoldcf-1.3-bin.tar.gz
    • sudo cp -R apache-manifoldcf-1.3 /usr/share/manifoldcf
  • Test
    • Open a terminal
      • cd  /usr/share/manifoldcf/example
      • sudo java -jar start.jar
    • Open a browser
    • Confirm that you see something like Figure 3.

 

ManifoldCF WebUI
Figure 3: ManifoldCF WebUI

Configure Solr

  • Recommended: Ignore Tika Errors
    • Edit /usr/share/solr/example/solr/collection1/conf/solrconfig.xml
    • Add the line <bool name=”ignoreTikaException”>true</bool> to the list <lst name=”defaults”> under <requestHandler name=”/update/extract” … >
    • After the changes the relevant part of the file should look like figure 4.  The text highlighting was added for clarity.

 

Configure ManifoldCF

Connect ManifoldCF to Solr

  • Start Solr if it is not already running
  • Open a browser
  • Click “List Output Connections”
  • Click “Add a new output connection”
  • Name = Solr
  • Description = Connect to Solr
  • Click “Type” tab
  • Connection type = Solr
  • Click “Continue” button
  • Recommended: Click “Documents” tab
    • Maximum document length = 10240000 (i.e., 10MB)
  • Click Save
  • Confirm that Connection Status = Connection working
    • (see figure 4).

 

Solr Output Connection
Figure 4: Solr Output Connection

Add Windows File Share Support

  • Stop ManifoldCF
  • Download http://jcifs.samba.org/src/jcifs-1.3.17.jar
  • Move the file to jcifs-1.3.17.jar /usr/share/manifoldcf/connector-lib-proprietary
  • Edit /usr/share/manifoldcf/connectors.xml
  • Uncomment <repositoryconnector name=”Windows shares”/>
    • See Figure 5.
  • save
  • Start ManifoldCF

 

connectors.xml
Figure 5: connectors.xml

Create new List Authority Connection to Windows File Share

  • Click “List Authority Connections”
  • Click “Add a new connection”
  • Name = Active Directory
  • Description = optional
  • Click “Type” tab
  • Connection type = Active Directory
  • Click “Continue” button
  • Click “Domain Controller” tab
  • Domain controller name = your-domain-controller-name
  • Domain suffix = your-domain-name
  • Administrative user name = user-account-with-adequate-permissions
  • Administrative password = password
  • Click “Add to End” button
  • Click “Save” button

Set up File Share Repository Connection

  • Click “List Repository Connections”
  • Click “Add new connection”
  • Name = file-share-name
  • Description = optional
  • Click “Type” tab
  • Connection type = Windows shares
  • Authority = Active Directory (this is the name selected when creating the authority)
  • Click “Continue” button
  • Click “Server” tab
  • Server = server-name
  • Authentication domain (optional) = domain-name
  • User name = user-account-with-adequate-permissions
  • Password = account-password
  • Use SIDS for security = Yes (check)
  • Click “Save” button

Set up SharePoint

Set up File Share Repository Connection

  • Click “List Repository Connections”
  • Click “Add new connection”
  • Name = SharePoint
  • Description = SharePoint
  • Click “Type” tab
  • Connection type = SharePoint
  • Authority = Windows File Share Permissions
  • Click “Continue” button
  • Click “Server” tab
  • Server SharePoint version = SharePoint Services 4.0 (2010)
  • Server Protocol = https
  • Server Name = your-server-name (e.g., intranet.my-domain.com)
  • Server Port = your-sharepoint-port (e.g., 4443)
  • Site path = path-to-site (e.g., “/sites/my-main-ste”)
  • User name = account-with-read-permissions
  • Password = account-password
  • Click “Browse” button
  • Naviagate to and select your certificate file (e.g., my-domain.com.cer)
  • Click “Add” button
  • Click “Save” button

Set up File Share Crawl Job

  • Confirm that Connection Status = Connection working
  • Click “List all Jobs”
  • Click “Add a new job”
  • Name = Crawl FileShare
  • Click “Connection” tab
  • Output connection = Solr
  • Repository connection = FileShare
  • Start method = Don’t…
  • Click “Continue” button
  • Click “Scheduling” tab
  • Schedule type = Scan … once
  • Recrawl interval (if continuous) = <blank>
  • Reseed interval (if continuous) = <blank>
  • Click “Paths” tab
  • Select name-of-share (e.g., cognivashare)
    • See Figure 6
  • Click “Add” button
  • Set Filters: See Figure 7.
    • Set 1. Include directory(s) matching *
    • Set 2. Include indexable file(s) matching *
    • Set 3. Exclude un-indexable file(s) matching *
  • Click “Security” tab
  • File security = Enabled
  • Share security = Disabled
  • Recommended: Click “Content Length” tab
    • Maximum document length = 10240000
  • Click “Save” button

Select Share
Figure 6: Select Share

 

Path Filters
Figure 7: Path Filters

Run the File Share Crawl Job

  • Click ” Status and Job Management”
  • Click “Start”
  • Confirm that the numbers under Documents, Active, and Processed are non-zero and increasing.
  • To view the processing in more detail, click “Result Histogram”
  • Connection = FileShare
  • Click Continue button.
  • Confirm that there is a list of file reading activities.