<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cogniva Research Blog &#187; Matthew Rutledge-Taylor</title>
	<atom:link href="http://blog.cognivaresearch.org/?author=12&#038;feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://blog.cognivaresearch.org</link>
	<description>Blog on information science &#38; more.</description>
	<lastBuildDate>Fri, 08 Nov 2013 19:29:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>How to Set up Solr and ManifoldCF on an Ubuntu Based Computer</title>
		<link>http://blog.cognivaresearch.org/?p=65</link>
		<comments>http://blog.cognivaresearch.org/?p=65#comments</comments>
		<pubDate>Thu, 26 Sep 2013 05:00:53 +0000</pubDate>
		<dc:creator>Matthew Rutledge-Taylor</dc:creator>
				<category><![CDATA[Automated Taxonomy Discovery]]></category>
		<category><![CDATA[Automated Taxonomy]]></category>
		<category><![CDATA[CISRI]]></category>
		<category><![CDATA[Cogniva]]></category>
		<category><![CDATA[IRAP]]></category>
		<category><![CDATA[Research]]></category>

		<guid isPermaLink="false">http://blog.cognivaresearch.org/?p=65</guid>
		<description><![CDATA[How to Set up Solr and ManifoldCF on an Ubuntu Based Computer This blog post is intended to provide some guidance on how to set up a computer to run Apache Solr (http://lucene.apache.org/solr/) and Apache ManifoldCF (http://manifoldcf.apache.org/).  Solr is a wrapper &#8230; <a href="http://blog.cognivaresearch.org/?p=65">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<h1>How to Set up Solr and ManifoldCF on an Ubuntu Based Computer</h1>
<p>This blog post is intended to provide some guidance on how to set up a computer to run Apache Solr (<a href="http://lucene.apache.org/solr/">http://lucene.apache.org/solr/</a>) and Apache ManifoldCF (<a href="http://manifoldcf.apache.org/">http://manifoldcf.apache.org/</a>).  Solr is a wrapper for Lucene.  It provides a web UI and a variety of features such as document text extraction (via Apache Tika).  ManifoldCF is a utility for scheduling jobs and providing repository connectors.  We have used it to import documents from both Windows (CIFS) file share and MS SharePoint 2010 into Solr.</p>
<p>This guide was written while installing and configuring Solr and ManifoldCF on a VirtualBox virtual machine running Linux Mint 15 (Mate) x64 (<a href="http://www.linuxmint.com/">http://www.linuxmint.com/</a>).  I chose Linux Mint because it is a &#8220;hot&#8221; GNU/Linux distribution these days (<a href="http://distrowatch.com/dwres.php?resource=major">http://distrowatch.com/dwres.php?resource=major</a>).  These instructions can be used to install/configure Sorl and ManifoldCF on Ubuntu.  You just need to be aware that the standard text editor on Mate in pluma and on gnome its gedit.  So, anywhere you see &#8216;pluma&#8217; below substitute &#8216;gedit&#8217; for ubuntu.  These instructions should also work on Debian, but I have not verified this to be the case (substitute &#8216;gedit&#8217; for &#8216;pluma&#8217;).</p>
<p>The development of this guide was a joint effort of Chris Salter and myself.</p>
<h1>Install Solr</h1>
<ul>
<li>Download Solr 4.3.1
<ul>
<li>Solr 4.4 has compatibility issues with the latest version of Mahout.</li>
<li><a href="http://archive.apache.org/dist/lucene/solr/4.3.1/solr-4.3.1.tgz">http://archive.apache.org/dist/lucene/solr/4.3.1/solr-4.3.1.tgz</a></li>
</ul>
</li>
<li>Decompress and move the solr-4.3.1 directory to /usr/share/solr</li>
<li>To do this via the terminal</li>
<ul>
<li>cd ~/Downloads</li>
<li>wget http://archive.apache.org/dist/lucene/solr/4.3.1/solr-4.3.1.tgz</li>
<li>tar -xzvf solr-4.3.1.tgz</li>
<li>sudo cp -R solr-4.3.1 /usr/share/solr</li>
</ul>
<li>Test Solr
<ul>
<li>Open a terminal
<ul>
<li>sudo java -jar /usr/share/solr/example/start.jar</li>
</ul>
</li>
<li>Open another terminal
<ul>
<li>cd exampledocs</li>
<li>java -jar post.jar *.xml</li>
</ul>
</li>
<li>Confirm that you see something like Figure 1.</li>
</ul>
</li>
</ul>
<p><img src="http://blog.cognivaresearch.org/images/solr_terminal.png" alt="Solr Test" /><br />
Figure 1: Solr test</p>
<ul>
<li>Open a browser
<ul>
<li><a href="http://localhost:8983/solr/">http://localhost:8983/solr/</a></li>
<li>Confirm that you see something like Figure 2.</li>
</ul>
</li>
</ul>
<p>&nbsp;</p>
<p><img src="http://blog.cognivaresearch.org/images/Solr.png" alt="Solr WebUI" /><br />
Figure 2: Solr WebUI</p>
<h1>Install ManifoldCF</h1>
<ul>
<li>Download ManifoldCF 1.3
<ul>
<li><a href="http://apache.mirror.rafal.ca/manifoldcf/apache-manifoldcf-1.3-bin.tar.gz">http://apache.mirror.rafal.ca/manifoldcf/apache-manifoldcf-1.3-bin.tar.gz</a></li>
</ul>
</li>
<li>Decompress and move the apache-manifoldcf-1.3 directory to /usr/share/manifoldcf</li>
<li>To do this via the terminal</li>
<ul>
<li>cd ~/Downloads</li>
<li>wget http://apache.mirror.rafal.ca/manifoldcf/apache-manifoldcf-1.3-bin.tar.gz</li>
<li>tar -xzvf apache-manifoldcf-1.3-bin.tar.gz</li>
<li>sudo cp -R apache-manifoldcf-1.3 /usr/share/manifoldcf</li>
</ul>
<li>Test
<ul>
<li>Open a terminal
<ul>
<li>cd  /usr/share/manifoldcf/example</li>
<li>sudo java -jar start.jar</li>
</ul>
</li>
<li>Open a browser
<ul>
<li><a href="http://localhost:8345/mcf-crawler-ui">http://localhost:8345/mcf-crawler-ui</a></li>
<li>Login
<ul>
<li>User ID:                admin</li>
<li>Password:           admin</li>
</ul>
</li>
</ul>
</li>
<li>Confirm that you see something like Figure 3.</li>
</ul>
</li>
</ul>
<p>&nbsp;</p>
<p><img src="http://blog.cognivaresearch.org/images/manifoldcf.png" alt="ManifoldCF WebUI" /><br />
Figure 3: ManifoldCF WebUI</p>
<h1>Configure Solr</h1>
<ul>
<li>Recommended: Ignore Tika Errors
<ul>
<li>Edit /usr/share/solr/example/solr/collection1/conf/solrconfig.xml</li>
<li>Add the line &lt;bool name=&#8221;ignoreTikaException&#8221;&gt;true&lt;/bool&gt; to the list &lt;lst name=&#8221;defaults&#8221;&gt; under &lt;requestHandler name=&#8221;/update/extract&#8221; … &gt;</li>
<li>After the changes the relevant part of the file should look like figure 4.  The text highlighting was added for clarity.</li>
</ul>
</li>
</ul>
<p>&nbsp;</p>
<h1>Configure ManifoldCF</h1>
<h2>Connect ManifoldCF to Solr</h2>
<ul>
<li>Start Solr if it is not already running</li>
<li>Open a browser
<ul>
<li><a href="http://localhost:8345/mcf-crawler-ui/">http://localhost:8345/mcf-crawler-ui/</a></li>
</ul>
</li>
<li>Click &#8220;List Output Connections&#8221;</li>
<li>Click &#8220;Add a new output connection&#8221;</li>
<li>Name = Solr</li>
<li>Description = Connect to Solr</li>
<li>Click &#8220;Type&#8221; tab</li>
<li>Connection type = Solr</li>
<li>Click &#8220;Continue&#8221; button</li>
<li>Recommended: Click &#8220;Documents&#8221; tab
<ul>
<li>Maximum document length = 10240000 (i.e., 10MB)</li>
</ul>
</li>
<li>Click Save</li>
<li>Confirm that Connection Status = Connection working
<ul>
<li>(see figure 4).</li>
</ul>
</li>
</ul>
<p>&nbsp;</p>
<p><img src="http://blog.cognivaresearch.org/images/OutputConnectionSolr.png" alt="Solr Output Connection" /><br />
Figure 4: Solr Output Connection</p>
<h2>Add Windows File Share Support</h2>
<ul>
<li>Stop ManifoldCF</li>
<li>Download http://jcifs.samba.org/src/jcifs-1.3.17.jar</li>
<li>Move the file to jcifs-1.3.17.jar /usr/share/manifoldcf/connector-lib-proprietary</li>
<li>Edit /usr/share/manifoldcf/connectors.xml</li>
<li>Uncomment &lt;repositoryconnector name=&#8221;Windows shares&#8221;/&gt;
<ul>
<li>See Figure 5.</li>
</ul>
</li>
<li>save</li>
<li>Start ManifoldCF</li>
</ul>
<p>&nbsp;</p>
<p><img src="http://blog.cognivaresearch.org/images/connectorsXmlCIFS.png" alt="connectors.xml" /><br />
Figure 5: connectors.xml</p>
<h2>Create new List Authority Connection to Windows File Share</h2>
<ul>
<li>Click &#8220;List Authority Connections&#8221;</li>
<li>Click &#8220;Add a new connection&#8221;</li>
<li>Name = Active Directory</li>
<li>Description = <em>optional</em></li>
<li>Click &#8220;Type&#8221; tab</li>
<li>Connection type = Active Directory</li>
<li>Click &#8220;Continue&#8221; button</li>
<li>Click &#8220;Domain Controller&#8221; tab</li>
<li>Domain controller name = <em>your-domain-controller-name</em></li>
<li>Domain suffix = <em>your-domain-name</em></li>
<li>Administrative user name = <em>user-account-with-adequate-permissions</em></li>
<li>Administrative password = <em>password</em></li>
<li>Click &#8220;Add to End&#8221; button</li>
<li>Click &#8220;Save&#8221; button</li>
</ul>
<h2>Set up File Share Repository Connection</h2>
<ul>
<li>Click &#8220;List Repository Connections&#8221;</li>
<li>Click &#8220;Add new connection&#8221;</li>
<li>Name = <em>file-share-name</em></li>
<li>Description = <em>optional</em></li>
<li>Click &#8220;Type&#8221; tab</li>
<li>Connection type = Windows shares</li>
<li>Authority = Active Directory (this is the name selected when creating the authority)</li>
<li>Click &#8220;Continue&#8221; button</li>
<li>Click &#8220;Server&#8221; tab</li>
<li>Server = <em>server-name</em></li>
<li>Authentication domain (optional) = <em>domain-name</em></li>
<li>User name = <em>user-account-with-adequate-permissions</em></li>
<li>Password = <em>account-password</em></li>
<li>Use SIDS for security = Yes (check)</li>
<li>Click &#8220;Save&#8221; button</li>
</ul>
<h2>Set up SharePoint</h2>
<ul>
<li>Download the SharePoint plugin.
<ul>
<li><a href="http://mirror.its.dal.ca/apache/manifoldcf/apache-manifoldcf-sharepoint-2010-plugin-0.2-bin.zip">http://mirror.its.dal.ca/apache/manifoldcf/apache-manifoldcf-sharepoint-2010-plugin-0.2-bin.zip</a></li>
</ul>
</li>
<li>Unzip the file locally.</li>
<li>Inside the folder, you’ll find README.TXT</li>
<li>Read that file, follow its instructions.  (Sorry, SharePoint administration is beyond the scope of this guide)</li>
</ul>
<h2>Set up File Share Repository Connection</h2>
<ul>
<li>Click &#8220;List Repository Connections&#8221;</li>
<li>Click &#8220;Add new connection&#8221;</li>
<li>Name = SharePoint</li>
<li>Description = SharePoint</li>
<li>Click &#8220;Type&#8221; tab</li>
<li>Connection type = SharePoint</li>
<li>Authority = Windows File Share Permissions</li>
<li>Click &#8220;Continue&#8221; button</li>
<li>Click &#8220;Server&#8221; tab</li>
<li>Server SharePoint version = SharePoint Services 4.0 (2010)</li>
<li>Server Protocol = https</li>
<li>Server Name = <em>your-server-name</em> (e.g., intranet.my-domain.com)</li>
<li>Server Port = <em>your-sharepoint-port</em> (e.g., 4443)</li>
<li>Site path = <em>path-to-site</em> (e.g., &#8220;/sites/my-main-ste&#8221;)</li>
<li>User name = <em>account-with-read-permissions</em></li>
<li>Password = <em>account-password</em></li>
<li>Click &#8220;Browse&#8221; button</li>
<li>Naviagate to and select your certificate file (e.g., my-domain.com.cer)</li>
<li>Click &#8220;Add&#8221; button</li>
<li>Click &#8220;Save&#8221; button</li>
</ul>
<h2>Set up File Share Crawl Job</h2>
<ul>
<li>Confirm that Connection Status = Connection working</li>
<li>Click &#8220;List all Jobs&#8221;</li>
<li>Click &#8220;Add a new job&#8221;</li>
<li>Name = Crawl FileShare</li>
<li>Click &#8220;Connection&#8221; tab</li>
<li>Output connection = Solr</li>
<li>Repository connection = FileShare</li>
<li>Start method = Don&#8217;t&#8230;</li>
<li>Click &#8220;Continue&#8221; button</li>
<li>Click &#8220;Scheduling&#8221; tab</li>
<li>Schedule type = Scan &#8230; once</li>
<li>Recrawl interval (if continuous) = &lt;blank&gt;</li>
<li>Reseed interval (if continuous) = &lt;blank&gt;</li>
<li>Click &#8220;Paths&#8221; tab</li>
<li>Select name-of-share (e.g., cognivashare)
<ul>
<li>See Figure 6</li>
</ul>
</li>
<li>Click &#8220;Add&#8221; button</li>
<li>Set Filters: See Figure 7.
<ul>
<li>Set 1. Include directory(s) matching *</li>
<li>Set 2. Include indexable file(s) matching *</li>
<li>Set 3. Exclude un-indexable file(s) matching *</li>
</ul>
</li>
<li>Click &#8220;Security&#8221; tab</li>
<li>File security = Enabled</li>
<li>Share security = Disabled</li>
<li>Recommended: Click &#8220;Content Length&#8221; tab
<ul>
<li>Maximum document length = 10240000</li>
</ul>
</li>
<li>Click &#8220;Save&#8221; button</li>
</ul>
<p><img src="http://blog.cognivaresearch.org/images/FileShareJobPaths.png" alt="Select Share" /><br />
Figure 6: Select Share</p>
<p>&nbsp;</p>
<p><img src="http://blog.cognivaresearch.org/images/FileShareJobFilters.png" alt="Path Filters" /><br />
Figure 7: Path Filters</p>
<h2>Run the File Share Crawl Job</h2>
<ul>
<li>Click &#8221; Status and Job Management&#8221;</li>
<li>Click “Start”</li>
<li>Confirm that the numbers under Documents, Active, and Processed are non-zero and increasing.</li>
<li>To view the processing in more detail, click “Result Histogram”</li>
<li>Connection = FileShare</li>
<li>Click Continue button.</li>
<li>Confirm that there is a list of file reading activities.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.cognivaresearch.org/?feed=rss2&#038;p=65</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
