Apache Tika – detects and extracts metadata and text from more than a thousand different types of files (for example, PPT, XLS, PDF and). All these file types can be parsed using a single interface provided by Apache Tika.
Apache Drill – allows you to organize execution of SQL-queries on semi-structured data stored in NoSQL-storages. The feature of Apache Drill is its independence from the data storage scheme, it allows organizing data analysis in various storages without first defining their structure (schema-free) .ion
R-language – is a programming language for statistical processing of data and working with graphics.
Solr – platform full-text search with open source, based on Apache Lucene. Its main features: full-text search, highlighting results, dynamic clustering, integration with databases, processing documents with a complex format (for example, Word, PDF).
Applications: Pentaho BA Server
Processing and data access: Pig, Hive, Sqoop, ETL, Lucene