Manage Content Collectors

The Collectors are the main point of integration between the external data sources and BravoSearch, so everything that is going to be searchable with BravoSearch needs to have a collector.

The Collector refers both to the section of code that is going out to collect information from the remote location as well as the scheduled task to perform that action.

Available Collectors

[back to top]

There are currently file collectors for the following Data Sources:

  • DNN
    • DataView - This collector is used to collect native and custom content from DNN, and represents the base collection in a SQL DataView. 
    • FileSystem - This collector is used to collect and parse out information from files that are included in the structure of a DNN site.
    • Module: DMX - This collects information about files from the Bring2Minds Document Exchange module (http://www.bring2mind.net/Document-Exchange).
    • Module: PackFlash - This collects information from PackFlashs Constellation module (https://www.packflash.com/)
  • Drupal 7
    • Drupal Content - This collects content from built-in data types for Drupal 7.
    • Drupal DataView - This collector allows for the collection of custom data sources exposed by Drupal 7.
  • Scraping Hub - This collector allows for the collection of data from ScrapingHub. (https://scrapinghub.com/)
  • Oasis LMS - This allows for the collection of data from the Oasis Learning Management System. (http://oasis-lms.com/)
  • HigherLogic LMS
  • Elsevier
  • DigitalIgnite LMS - This module allows for the collection of data from the Digital Ignite Learning Management System. (http://www.yourmembership.com/products/learning-management-system/)
  • BlueSky LMS - This module allows for the collection of data from the BlueSky Learning Management System. (http://www.blueskyelearn.com/)
  • Most RSS Feeds - BravoSearch has built in support for the collection of RSS Feeds from various data sources built on the RSS Specification.
  • Most XML Services / Feeds - BravoSearch has built in support for the collection of XML data from Services (SOAP) and XML data feeds. XML sources without external resources or links are supported out of the box.
  • Most JSON Services / Feeds - BravoSearch has build in support for JSON services and data feeds. JSON sources completed with a single call are supported out of the box.
  • Private Data Feeds - Have a data source that you don't want exposed (like those listed above?) to all of the users of the BravoSearch system? That's cool, we support private collectors that are locked down to your instance.
  • Additional Data Feeds - BravoSearch is setup to allow for the quick consumption of data from many different data sources, if you don't see one included on the list above it's possible for our team of developers to create

File Type Scraping Support:

The following types are supported for the extraction of content from files.

  • csv - Comma Separated file format
  • doc - Work 2003 - 2010 file format
  • docx - Word 2010-2017 file format
  • pdf - Adobe Acrobat file format
  • pptx - PowerPoint 2010-2017 file format
  • rtf - Rich Text file format
  • txt - Text file format
  • xls - Excel 2003 - 2007 file format
  • xlsx - Excel 2010 - 2017 file format

Collector Diagnostics

[back to top]

Status - Indicates the current status of the Collector Instance that we are working with. There are currently three states that the Collector Instance can be in:

  • Enabled - Collectors in this state are configured properly and will execute on the next run time.
  • Disabled - Collectors in this state are configured properly, but are paused.
  • Error - Collector is currently not configured properly, or the system was unreachable for the number of calls specified in the settings.

Friendly Name - The friendly name is an indicator for you, the customer, to help you communicate and differentiate between instances of the search collector.

Runs Every - This is an indicator of how and when the collector should be run, this is configurable from the settings exposed through the edit selection menu.

Last Complete - Indicates the last time that the collector ran to completion.  Please note that this number will not update until the collector has completed collection.

Last Queue - Indicates the last time that the collection was queued for execution,  generally speaking the queuing of the collection happens then the collection worker will pick up that task as soon as it's available, then once it's complete it will update the Last Complete date stamp.

Last Error - Indicates the last time that the collector failed due to an error, the most common reason that collectors fail is the inaccessibility of the data source.

Modifying Collectors

[back to top]

Warning: Modification of the settings or that adding of settings to the collector can have detrimental effects on BravoCommand, BravoSearch, and the Collecting source.  Unless you are absolutely certain of what you are doing it is strongly recommended that you reach out to a BravoSquared representative for help.

Adding a Collector

[back to top]

To add a collector to the collection process navigate to the heading of the desired collector and click on the plus icon ( ). This will open that Add a Collector dialog that will allow the creation and implementation of the collector.

Add a Collector: Screen 1

On the first part of the configuration screen are a couple of options that are used for the initial configuration of the collector:

Friendly Name - The friendly name is a field that has no further application in the product other than to make it easier to reference when discussing it.  Many customers will use a patterns of the object that they are collecting and the environment that they are planning to collect that information from (ex. Website Articles - Dev).

Select Index - This is the index that will be populated with this data.  In most instances there will only be one option to select from for this process.

Advanced: Maximum Collector Failures - Indicates the number of times that a collector can fail before entering into an errored state.

Advanced: Max Pages - Indicates the maximum number of pages that should be collected,  This is helpful in setting finite windows for infinite data sources (ex Logs). Warning: Setting this value too high for large data sets have the opportunity to cause collection runs to wrap which could lead to denial of service against the point where data collection is happening.

Advanced: Start Page - If the source supports paging this indicates the page that we should start indexing.

Advanced: Page Size - If the source supports page sizes this can be helpful in limiting the amount that is pulled from the data source. Warning: Setting the page size too large for data sets that have large objects can cause failures during deserialization and ingestion of the log. This will cause the collector to error out.

Add a Collector: Screen 2

The second screen of the Add a Collector dialog is for setting the schedule of the collector. It should be noted that this only schedules the time that the process should be kicked off, the process still needs an open execution slot to operate, if the queue is empty it will kick off every 5 min to look for work. 

Options for Collection:  All collection times are specified in UTC

Hourly:  Indicates that the execution of the collector should be queued at a particular time every hour.

Daily: Indicates that the execution of the collector should be queued at a particular time every day, all times are measured in UTC.

Weekly: Indicates that the execution of the collector should be queued at a particular time every day, all times are measured in UTC

Warning: Do not set all of your collectors to the same time, and be aware of the amount of time that it takes a collector to complete. Failure to take this into account will cause a catastrophic failure in the application. It should also be noted that setting collection of several services located on the same server, can lead to heavy load being placed on that server similar to a denial of service attack.

Add a Collector: Screen 3

The next step in the collection process is custom to each of the collectors... for the example in this screen we are including the collection from an API endpoint.  Each collector will have their own settings but it's common to have a URL where the content will be located, any special characters or commands that need to be included in the query string, and type information that corresponds to the type from the Indexes Screen.

Add a Collector: Screen 4

If the collector that you are working with supports dynamic collection and dynamic mapping you will be presented with a screen that will walk you through the process of mapping out the object in the notation where the left had side of the mapping represents the object as it will appear in the index, and the right hand side of the object will represent the incoming object from the feed.

Warning: Failure to set the mappings correctly or set the base mappings will result in a failure and may result in the index having to be deleted and recreated, proceed with extreme caution.

Modifying A Collector

[back to top]

Modifying a collector is fairly straight forward and follows the same flow as the creation of a collector.  To get to the modification menu select the wrench icon from the end of one of the lines inside the collector and select edit.

Modify a Collector: Screen 1

On the first part of the configuration screen are a couple of options that are used for the initial configuration of the collector:

Friendly Name - The friendly name is a field that has no further application in the product other than to make it easier to reference when discussing it.  Many customers will use a patterns of the object that they are collecting and the environment that they are planning to collect that information from (ex. Website Articles - Dev).

Select Index - This is the index that will be populated with this data.  In most instances there will only be one option to select from for this process.

Advanced: Maximum Collector Failures - Indicates the number of times that a collector can fail before entering into an errored state.

Advanced: Max Pages - Indicates the maximum number of pages that should be collected,  This is helpful in setting finite windows for infinite data sources (ex Logs). Warning: Setting this value too high for large data sets have the opportunity to cause collection runs to wrap which could lead to denial of service against the point where data collection is happening.

Advanced: Start Page - If the source supports paging this indicates the page that we should start indexing.

Advanced: Page Size - If the source supports page sizes this can be helpful in limiting the amount that is pulled from the data source. Warning: Setting the page size too large for data sets that have large objects can cause failures during deserialization and ingestion of the log.  This will cause the collector to error out.

Modify a Collector: Screen 2

The second screen of the Modify a Collector dialog is for setting the schedule of the collector. It should be noted that this only schedules the time that the process should be kicked off, the process still needs an open execution slot to operate, if the queue is empty it will kick off every 5 min to look for work. 

Options for Collection:  All collection times are specified in UTC

Hourly:  Indicates that the execution of the collector should be queued at a particular time every hour.

Daily: Indicates that the execution of the collector should be queued at a particular time every day, all times are measured in UTC.

Weekly: Indicates that the execution of the collector should be queued at a particular time every day, all times are measured in UTC

Warning: Do not set all of your collectors to the same time, and be aware of the amount of time that it takes a collector to complete. Failure to take this into account will cause a catastrophic failure in the application. It should also be noted that setting collection of several services located on the same server, can lead to heavy load being placed on that server similar to a denial of service attack.

Modify a Collector: Screen 3

The next step in the collection process is custom to each of the collectors... for the example in this screen we are including the collection from an API endpoint.  Each collector will have their own settings but it's common to have a URL where the content will be located, any special characters or commands that need to be included in the query string, and type information that corresponds to the type from the Indexes Screen.

Modify a Collector: Screen 4

If the collector that you are working with supports dynamic collection and dynamic mapping you will be presented with a screen that will walk you through the process of mapping out the object in the notation where the left had side of the mapping represents the object as it will appear in the index, and the right hand side of the object will represent the incoming object from the feed.

Warning: Failure to set the mappings correctly or set the base mappings will result in a failure and may result in the index having to be deleted and recreated, proceed with extreme caution.

Running A Collector Ad-Hoc

[back to top]

You have the option to run all of the collectors in an ad-hoc manner. To do this navigate to the wrench menu and select Run Now.  This will now queue the collection of the specified collector at the first available convenience, when a processing slot has opened. As the process gets started the Last Queue Date will updated, once it has run to completion either the Last Complete or Last Error columns will update.

Other Collector Tasks

[back to top]

Clone

Occasionally for some tasks it may make sense to clone collection tasks, such as in the event of processes that run at different times over a long process.  To help quickly facilitate this there is a built in process for cloning collections. To access this option select the wrench menu then click Clone.

Disable

From time to time it may make sense to either not collect data, but you still want the data to be available. To achieve this the BravoSearch product allows you to pause the collection of data by selecting the Wrench Menu and Clicking on Disable.

Delete

In the event that there is the need to remove a collector from the system you can do so by navigating to the wrench menu and selecting Delete.