components/indri/patches/pia.patch
author Rich Burridge <rich.burridge@oracle.com>
Wed, 17 Dec 2014 15:33:37 -0800
changeset 3558 2cec274f17fc
parent 1626 8dee2dfe2525
permissions -rw-r--r--
20222479 Need a method to compare test results against a master in Userland
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
1626
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
     1
Add our PIA wrapper to indri sources. This patch does several things:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
     2
 - Add pia wrapper sources to indri source tree
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
     3
 - Add new tokenizer which does not treat '_' as a separator
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
     4
   - The TextTokenizerPIA.l differs from TextTokenizer.l only in single character
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
     5
      -[a-zA-Z0-9']+  { byte_position += tokleng; return ASCII_TOKEN; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
     6
      +[a-zA-Z0-9_']+ { byte_position += tokleng; return ASCII_TOKEN; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
     7
   - plus many symbol renames so that the parsers can coexist (toktext -> piatoktext etc.)
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
     8
   - TextTokenizerPIA.hpp contains only symbol renamse
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
     9
 - Rest are modifications to make indri build PIA wrapper
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    10
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    11
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    12
--- indri-5.4/pia_wrapper.cpp	po črc 15 14:30:41 2013
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    13
+++ indri-5.4/pia_wrapper.cpp	po črc 15 14:29:09 2013
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    14
@@ -0,0 +1,222 @@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    15
+/*
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    16
+ * TO compile :
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    17
+ *      g++ -o libpia_wrapper.so -shared -fPIC -I../vlad-libs/sparc/usr/include/ -L../vlad-libs/sparc/usr/lib/ -lclucene-core -lnvpair pia_wrapper.cc
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    18
+ *
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    19
+ */
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    20
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    21
+#include <sys/stat.h>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    22
+#include <strings.h>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    23
+#include <stdio.h>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    24
+#include <libnvpair.h>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    25
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    26
+#include <iostream>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    27
+#include <string>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    28
+#include <sstream>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    29
+#include <fstream>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    30
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    31
+#include <vector>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    32
+#include "indri/QueryEnvironment.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    33
+#include "indri/SnippetBuilder.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    34
+#include "indri/Repository.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    35
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    36
+using namespace std;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    37
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    38
+using namespace indri::api;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    39
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    40
+#define MAX_RESULTS 3
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    41
+#define PIA_DATABASE "/var/db/piadb"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    42
+#define PIA_DATABASE_STORAGE PIA_DATABASE "/collection/storage"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    43
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    44
+indri::collection::Repository repository;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    45
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    46
+std::string
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    47
+getFieldText(int documentID, std::string field) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    48
+	std::string ret_val = "";
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    49
+	indri::collection::Repository::index_state repIndexState = repository.indexes();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    50
+	indri::index::Index *thisIndex=(*repIndexState)[0];
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    51
+	int fieldID=thisIndex->field(field);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    52
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    53
+	if (fieldID < 1) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    54
+		return "";
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    55
+	}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    56
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    57
+	const indri::index::TermList *termList=thisIndex->termList(documentID);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    58
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    59
+	if (!termList) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    60
+		return "";
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    61
+	}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    62
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    63
+	indri::utility::greedy_vector< indri::index::FieldExtent > fieldVec=termList->fields();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    64
+	indri::utility::greedy_vector< indri::index::FieldExtent >::iterator fIter=fieldVec.begin();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    65
+	while (fIter!=fieldVec.end()) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    66
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    67
+		if ((*fIter).id==fieldID) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    68
+			int beginTerm=(*fIter).begin;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    69
+			int endTerm=(*fIter).end;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    70
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    71
+	        	/*
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    72
+	 	 	 * note that the text is inclusive of the beginning
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    73
+		         * but exclusive of the ending
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    74
+		 	 */
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    75
+			for (int t=beginTerm; t < endTerm; t++) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    76
+				int thisTermID=termList->terms()[t];
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    77
+		       		ret_val = ret_val + thisIndex->term(thisTermID) + " ";
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    78
+			}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    79
+		}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    80
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    81
+		fIter++;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    82
+	}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    83
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    84
+	delete termList;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    85
+	termList=NULL;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    86
+	return ret_val;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    87
+}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    88
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    89
+/*
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    90
+ * Returns NULL on failure
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    91
+ * nvlist *
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    92
+ * search(
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    93
+ *  nvlist_t *search_params,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    94
+ *  char **errmsg            // Similar to pia_index()
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    95
+ * );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    96
+ */
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    97
+nvlist *
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    98
+search (nvlist_t *search_params, char **errmsg) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
    99
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   100
+	char *index_path = PIA_DATABASE;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   101
+	nvlist_t **nvl_list_result;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   102
+	nvlist_t *nvl_return;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   103
+	nvlist_t *nvl_result;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   104
+	nvlist_t *results = NULL;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   105
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   106
+	if (nvlist_alloc(&results, NV_UNIQUE_NAME, 0) != 0) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   107
+		*errmsg = strdup("nvlist_alloc failed\n");
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   108
+		return NULL;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   109
+	}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   110
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   111
+	try {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   112
+		std::string query;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   113
+		char *panicstack;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   114
+		(void) nvlist_lookup_string(search_params, "stack", &panicstack);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   115
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   116
+		QueryEnvironment indriEnvironment;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   117
+		indriEnvironment.addIndex(index_path);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   118
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   119
+		/* Create Indri query */
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   120
+		query = "#combine (" + std::string(panicstack) + ")";
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   121
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   122
+		QueryAnnotation *QAresults=indriEnvironment.runAnnotatedQuery(query.c_str(), MAX_RESULTS);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   123
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   124
+		std::vector<indri::api::ScoredExtentResult> resultVector=QAresults->getResults();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   125
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   126
+		int totalNumResults=resultVector.size();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   127
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   128
+		/* Get Parsed document of the results */
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   129
+		std::vector<ParsedDocument*> parsedDocs=indriEnvironment.documents(resultVector);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   130
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   131
+		int results_to_return = 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   132
+		for ( size_t i=0; i < totalNumResults && i < MAX_RESULTS; i++ ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   133
+				results_to_return++;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   134
+		}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   135
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   136
+		/* Open Repository */
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   137
+		repository.openRead(index_path);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   138
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   139
+		nvl_list_result = (nvlist_t **) malloc(results_to_return * sizeof(nvlist_t *));
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   140
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   141
+		for ( size_t i=0; i < results_to_return; i++ ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   142
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   143
+			std::string ret="";
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   144
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   145
+			int thisResultDocID=resultVector[i].document;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   146
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   147
+			if (nvlist_alloc(&nvl_list_result[i], NV_UNIQUE_NAME, 0) != 0) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   148
+				*errmsg = strdup("nvlist_alloc failed\n");
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   149
+				return NULL;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   150
+			}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   151
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   152
+			if ((ret = getFieldText(thisResultDocID, "bug")) == "") {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   153
+				*errmsg = strdup("Lookup of bugid failed\n");
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   154
+				return NULL;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   155
+			} else if (nvlist_add_string(nvl_list_result[i], "pia-bugid", ret.c_str())) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   156
+				*errmsg = strdup("nvlist_add bugid failed\n");
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   157
+				return NULL;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   158
+			}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   159
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   160
+			if ((ret = getFieldText(thisResultDocID, "stack")) == "") {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   161
+				*errmsg = strdup("Lookup of stack failed\n");
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   162
+				return NULL;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   163
+			} else if (nvlist_add_string(nvl_list_result[i], "pia-stack", ret.c_str())) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   164
+				*errmsg = strdup("nvlist_add stack failed\n");
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   165
+				return NULL;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   166
+			}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   167
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   168
+			if ((ret = getFieldText(thisResultDocID, "signature")) == "") {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   169
+				*errmsg = strdup("Lookup of signature failed\n");
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   170
+				return NULL;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   171
+			} else if (nvlist_add_string(nvl_list_result[i], "pia-signature", ret.c_str())) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   172
+				*errmsg = strdup("nvlist_add signature failed\n");
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   173
+				return NULL;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   174
+			}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   175
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   176
+			int indri_score = 1000 + (int)resultVector[i].score*1000;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   177
+			if (nvlist_add_int32(nvl_list_result[i], "pia-score", indri_score)) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   178
+				*errmsg = strdup("nvlist_add score failed\n");
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   179
+				return NULL;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   180
+			}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   181
+		}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   182
+		repository.close();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   183
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   184
+		nvlist_add_nvlist_array(results, "results", nvl_list_result, results_to_return);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   185
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   186
+		for (int i=0; i<results_to_return; i++) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   187
+			nvlist_free(nvl_list_result[i]);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   188
+		}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   189
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   190
+		return results;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   191
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   192
+	} catch(...){
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   193
+		nvl_list_result = (nvlist_t **) malloc(1 * sizeof(nvlist_t **));
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   194
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   195
+		if (nvlist_alloc(&nvl_result, NV_UNIQUE_NAME, 0) != 0) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   196
+			*errmsg = strdup("nvlist_alloc failed\n");
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   197
+			return NULL;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   198
+		}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   199
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   200
+		if (nvlist_add_string(nvl_result, "error", "Indri Error")) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   201
+			*errmsg = strdup("nvlist_add error failed\n");
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   202
+			return NULL;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   203
+                }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   204
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   205
+		nvlist_dup(nvl_result, &nvl_list_result[0], 0);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   206
+		nvlist_free(nvl_result);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   207
+		nvlist_add_nvlist_array(results, "results", nvl_list_result, 1);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   208
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   209
+		return results;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   210
+        }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   211
+}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   212
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   213
+extern "C" nvlist*
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   214
+pia_search (nvlist_t *search_params, char **errmsg) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   215
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   216
+	return search (search_params, errmsg);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   217
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   218
+}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   219
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   220
+int
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   221
+init () {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   222
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   223
+	struct stat sb;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   224
+	if (stat(PIA_DATABASE_STORAGE, &sb) != 0) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   225
+		return 1;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   226
+	}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   227
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   228
+	return 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   229
+}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   230
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   231
+extern "C" int
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   232
+pia_init () {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   233
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   234
+	return init ();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   235
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   236
+}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   237
--- indri-5.4/src/TextTokenizerPIA.l	po črc 15 14:38:12 2013
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   238
+++ indri-5.4/src/TextTokenizerPIA.l	po črc 15 14:36:55 2013
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   239
@@ -0,0 +1,588 @@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   240
+%option noyywrap
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   241
+%option never-interactive
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   242
+%option prefix="piatok"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   243
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   244
+%{
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   245
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   246
+/*==========================================================================
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   247
+ * Copyright (c) 2004 University of Massachusetts.  All Rights Reserved.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   248
+ *
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   249
+ * Use of the Lemur Toolkit for Language Modeling and Information Retrieval
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   250
+ * is subject to the terms of the software license set forth in the LICENSE
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   251
+ * file included with this software, and also available at
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   252
+ * http://www.lemurproject.org/license.html
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   253
+ *
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   254
+ *==========================================================================
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   255
+ */
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   256
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   257
+//
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   258
+// TextTokenizerPIA
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   259
+//
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   260
+// 15 September 2005 -- mwb
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   261
+//
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   262
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   263
+#include <string.h>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   264
+#include <ctype.h>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   265
+#include "indri/TextTokenizerPIA.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   266
+#include "indri/TermExtent.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   267
+#include "indri/TagEvent.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   268
+#include "indri/TokenizedDocument.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   269
+#include "indri/UnparsedDocument.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   270
+#include "indri/UTF8Transcoder.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   271
+#include "indri/AttributeValuePair.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   272
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   273
+static long byte_position;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   274
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   275
+#define ZAP           1
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   276
+#define TAG           2
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   277
+#define ASCII_TOKEN   3
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   278
+#define UTF8_TOKEN    4
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   279
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   280
+%}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   281
+%start COMMENT
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   282
+%%
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   283
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   284
+"<!--" { BEGIN(COMMENT); byte_position += piatokleng; return ZAP; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   285
+<COMMENT>[^-]+ { byte_position += piatokleng; return ZAP; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   286
+<COMMENT>"-->" { BEGIN(INITIAL); byte_position += piatokleng; return ZAP; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   287
+<COMMENT>"-" { byte_position += piatokleng; return ZAP; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   288
+"<!"[^\>]*">" { byte_position += piatokleng; return ZAP; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   289
+\<[a-zA-Z/][^\>]*\>                                             { byte_position += piatokleng; return TAG; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   290
+[&]([a-zA-Z]+|[#]([0-9]+|[xX][a-fA-F0-9]+))[;]         { byte_position += piatokleng; return ZAP; /* symbols */ }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   291
+[A-Z0-9]"."([A-Z0-9]".")*                                        { byte_position += piatokleng; return ASCII_TOKEN; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   292
+[a-zA-Z0-9_']+                                        { byte_position += piatokleng; return ASCII_TOKEN; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   293
+"-"[0-9]+("."[0-9]+)?                                  { byte_position += piatokleng; return ASCII_TOKEN; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   294
+[a-zA-Z0-9\x80-\xFD]+                               { byte_position += piatokleng; return UTF8_TOKEN; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   295
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   296
+[\n]                                                   { byte_position += piatokleng; return ZAP; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   297
+.                                                      { byte_position += piatokleng; return ZAP; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   298
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   299
+%%
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   300
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   301
+indri::parse::TokenizedDocument* indri::parse::TextTokenizerPIA::tokenize( indri::parse::UnparsedDocument* document ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   302
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   303
+  _termBuffer.clear();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   304
+  if ( _tokenize_entire_words)
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   305
+    _termBuffer.grow( document->textLength * 4);
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   306
+  else
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   307
+    _termBuffer.grow( document->textLength * 8 ); // extra null per char.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   308
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   309
+  _document.terms.clear();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   310
+  _document.tags.clear();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   311
+  _document.positions.clear();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   312
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   313
+  _document.metadata = document->metadata;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   314
+  _document.text = document->text;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   315
+  _document.textLength = document->textLength;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   316
+  _document.content = document->content;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   317
+  _document.contentLength = document->contentLength;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   318
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   319
+  // byte offset
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   320
+  byte_position = document->content - document->text;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   321
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   322
+  piatok_scan_bytes( document->content, document->contentLength );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   323
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   324
+  // Main Tokenizer loop
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   325
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   326
+  int type;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   327
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   328
+  while ( type = piatoklex() ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   329
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   330
+    switch ( type ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   331
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   332
+    case ASCII_TOKEN: processASCIIToken(); break;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   333
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   334
+    case UTF8_TOKEN: processUTF8Token(); break;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   335
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   336
+    case TAG: if ( _tokenize_markup ) processTag(); break;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   337
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   338
+    default:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   339
+    case ZAP:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   340
+      break;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   341
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   342
+    }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   343
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   344
+  }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   345
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   346
+  piatok_delete_buffer( YY_CURRENT_BUFFER );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   347
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   348
+  return &_document;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   349
+}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   350
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   351
+// Member functions for processing tokenization events as dispatched
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   352
+// from the main tokenizer loop
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   353
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   354
+void indri::parse::TextTokenizerPIA::processTag() {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   355
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   356
+  // Here, we parse the tag in a fashion that is relatively robust to
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   357
+  // malformed markup.  toktext matches this pattern: <[^>]+>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   358
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   359
+  if ( piatoktext[1] == '?' || piatoktext[1] == '!' ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   360
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   361
+    // XML declaration like <? ... ?> and <!DOCTYPE ... >
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   362
+    return; // ignore
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   363
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   364
+  } else if ( piatoktext[1] == '/' ) { // close tag, eg. </FOO>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   365
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   366
+    // Downcase the tag name.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   367
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   368
+    int len = 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   369
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   370
+    for ( char *c = piatoktext + 2;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   371
+#ifndef WIN32
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   372
+          isalnum( *c ) || *c == '-' || *c == '_' || *c == ':' ; c++ ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   373
+#else
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   374
+          ((*c >= 0) && isalnum( *c )) || *c == '-' || *c == '_' || *c == ':' ; c++ ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   375
+#endif
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   376
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   377
+      *c = tolower( *c );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   378
+      if ( *c == ':' ) *c = '_'; /* replace colon (from namespaces) */
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   379
+      len++;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   380
+    }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   381
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   382
+    TagEvent te;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   383
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   384
+    te.open_tag = false;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   385
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   386
+    // We need to write len characters, plus a NULL
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   387
+    char* write_loc = _termBuffer.write( len + 1 );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   388
+    strncpy( write_loc, piatoktext + 2, len );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   389
+    write_loc[len] = '\0';
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   390
+    te.name = write_loc;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   391
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   392
+    // token position of tag event w/r/t token string
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   393
+    te.pos = _document.terms.size();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   394
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   395
+    te.begin = byte_position - piatokleng;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   396
+    te.end = byte_position;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   397
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   398
+    _document.tags.push_back( te );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   399
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   400
+#ifndef WIN32
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   401
+    } else if ( isalpha( piatoktext[1] ) ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   402
+#else
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   403
+    } else if ( (piatoktext[1]  >= 0) && (isalpha( piatoktext[1] ) )) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   404
+#endif
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   405
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   406
+    // Try to extract the tag name:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   407
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   408
+    char* c = piatoktext + 1;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   409
+    int i = 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   410
+    int offset = 1; // current offset w/r/t byte_position - piatokleng
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   411
+    // it starts at one because it is incremented when c is, and c starts at one.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   412
+    char* write_loc;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   413
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   414
+#ifndef WIN32
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   415
+    while ( isalnum( c[i] ) || c[i] == '-' || c[i] == '_' || c[i] == ':' ) i++;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   416
+#else
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   417
+    while ( ( (c[i] >= 0) && isalnum( c[i] )) || c[i] == '-' || c[i] == '_' || c[i] == ':' ) i++;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   418
+#endif
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   419
+    if ( c[i] == '>' ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   420
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   421
+      // open tag with no attributes, eg. <title>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   422
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   423
+      // Ensure tag name is downcased
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   424
+      for ( int j = 0; j < i; j++ ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   425
+        c[j] = tolower( c[j] );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   426
+        if ( c[j] == ':' ) c[j] = '_'; /* replace colon (from namespaces) */
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   427
+      }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   428
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   429
+      TagEvent te;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   430
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   431
+      te.open_tag = true;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   432
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   433
+      // need to write i characters, plus a NULL
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   434
+      char* write_loc = _termBuffer.write( i + 1 );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   435
+      strncpy( write_loc, c, i );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   436
+      write_loc[i] = '\0';
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   437
+      te.name = write_loc;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   438
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   439
+      te.pos = _document.terms.size();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   440
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   441
+      te.begin = byte_position - piatokleng;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   442
+      te.end = byte_position;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   443
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   444
+      _document.tags.push_back( te );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   445
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   446
+#ifndef WIN32
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   447
+    } else if ( isspace( c[i] ) ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   448
+#else
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   449
+    } else if ( (c[i]  >= 0) && (isspace( c[i] ) )) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   450
+#endif
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   451
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   452
+      // open tag with attributes, eg. <A HREF="www.foo.com/bar">
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   453
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   454
+      TagEvent te;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   455
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   456
+      te.open_tag = true;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   457
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   458
+      // Ensure tag name is downcased
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   459
+      for ( int j = 0; j < i; j++ ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   460
+        c[j] = tolower( c[j] );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   461
+        if ( c[j] == ':' ) c[j] = '_'; /* replace colon (from namespaces) */
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   462
+      }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   463
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   464
+      // need to write i characters, plus a NULL
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   465
+      char* write_loc = _termBuffer.write( i + 1 );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   466
+      strncpy( write_loc, c, i );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   467
+      write_loc[i] = '\0';
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   468
+      te.name = write_loc;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   469
+      c += i;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   470
+      offset += i;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   471
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   472
+#ifndef WIN32
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   473
+    while ( isspace( *c ) ) { c++; offset++; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   474
+#else
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   475
+    while (((*c) >=0) &&  isspace( *c )) { c++; offset++; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   476
+#endif
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   477
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   478
+      te.pos = _document.terms.size();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   479
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   480
+      te.begin = byte_position - piatokleng;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   481
+      te.end = byte_position;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   482
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   483
+      // Now search for attributes:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   484
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   485
+      while ( *c != '>' && *c != '\0' ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   486
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   487
+        AttributeValuePair avp;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   488
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   489
+        // Try to extract attribute name:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   490
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   491
+        i = 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   492
+#ifndef WIN32
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   493
+        while ( isalnum( c[i] ) || c[i] == '-' || c[i] == '_' ) i++;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   494
+#else
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   495
+        while ( (c[i] >= 0) && isalnum( c[i] ) || c[i] == '-' || c[i] == '_') i++;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   496
+#endif
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   497
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   498
+        if ( i == 0 ) break;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   499
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   500
+        // Ensure attribute name is downcased
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   501
+        for ( int j = 0; j < i; j++ )
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   502
+          c[j] = tolower( c[j] );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   503
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   504
+        // need to write i characters, plus a NULL
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   505
+        write_loc = _termBuffer.write( i + 1 );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   506
+        strncpy( write_loc, c, i );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   507
+        write_loc[i] = '\0';
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   508
+        avp.attribute = write_loc;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   509
+        c += i;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   510
+        offset += i;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   511
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   512
+        // attributes can be foo\s*=\s*"bar[">] or foo\s*=\s*bar
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   513
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   514
+		// ignore any spaces
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   515
+#ifndef WIN32
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   516
+    while ( isspace( *c ) ) { c++; offset++; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   517
+#else
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   518
+    while (((*c) >=0) &&  isspace( *c )) { c++; offset++; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   519
+#endif
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   520
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   521
+        if ( *c == '=' ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   522
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   523
+          c++; // get past the '=' sign.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   524
+          offset++;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   525
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   526
+#ifndef WIN32
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   527
+    while ( isspace( *c ) ) { c++; offset++; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   528
+#else
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   529
+    while (((*c) >=0) &&  isspace( *c )) { c++; offset++; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   530
+#endif
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   531
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   532
+          if ( *c == '>' ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   533
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   534
+            // common malformed markup <a href=>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   535
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   536
+            // Insert empty attribute value
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   537
+            // need to write a single NULL
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   538
+            write_loc = _termBuffer.write( 1 );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   539
+            write_loc[0] = '\0';
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   540
+            avp.value = write_loc;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   541
+            avp.begin = byte_position - piatokleng + offset;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   542
+            avp.end = byte_position - piatokleng + offset;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   543
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   544
+          } else {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   545
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   546
+            bool quoted = true;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   547
+            char quote_char;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   548
+            if ( *c == '"' || *c =='\'' ) { quote_char = *c; c++; offset++; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   549
+            else quoted = false;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   550
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   551
+            // Attribute value starts here.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   552
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   553
+            i = 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   554
+// make sure the opening and closing quote character match...
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   555
+            if ( quoted )
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   556
+//              while ( c[i] != '"' && c[i] != '>' && c[i] !='\'') i++;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   557
+              while ( c[i] != quote_char && c[i] != '>') i++;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   558
+            else
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   559
+#ifndef WIN32
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   560
+              while ( ! isspace( c[i] ) && c[i] != '>' ) i++;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   561
+#else
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   562
+              while ( ((c[i] >= 0)  && ! isspace( c[i] ) ) && c[i] != '>' ) i++;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   563
+#endif
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   564
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   565
+            // need to write i characters, plus a NULL
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   566
+            write_loc = _termBuffer.write( i + 1 );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   567
+            strncpy( write_loc, c, i );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   568
+            write_loc[i] = '\0';
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   569
+            avp.value = write_loc;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   570
+            avp.begin = byte_position - piatokleng + offset;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   571
+            avp.end = byte_position - piatokleng + offset + i;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   572
+            c += i;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   573
+            offset += i;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   574
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   575
+          }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   576
+        } else {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   577
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   578
+          // Insert empty attribute value
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   579
+          // need to write a single NULL
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   580
+          write_loc = _termBuffer.write( 1 );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   581
+          write_loc[0] = '\0';
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   582
+          avp.value = write_loc;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   583
+          avp.begin = byte_position - piatokleng + offset;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   584
+          avp.end = byte_position - piatokleng + offset;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   585
+        }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   586
+#ifndef WIN32
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   587
+        while ( isspace( *c ) || *c == '"' ) { c++; offset++; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   588
+#else
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   589
+        while ( ((*c >= 0) && isspace( *c )) || *c == '"' ) { c++; offset++; }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   590
+#endif
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   591
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   592
+        te.attributes.push_back( avp );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   593
+      }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   594
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   595
+      _document.tags.push_back( te );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   596
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   597
+    }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   598
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   599
+    // One of the cases that is ignored is this common malformed
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   600
+    // markup <foo=bar> with no tag name.  Another is the case
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   601
+    // of an email address <[email protected]>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   602
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   603
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   604
+  }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   605
+}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   606
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   607
+void indri::parse::TextTokenizerPIA::processUTF8Token() {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   608
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   609
+  // A UTF-8 token, as recognized by flex, could actually be
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   610
+  // a mixed ASCII/UTF-8 string containing any number of
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   611
+  // UTF-8 characters, so we re-tokenize it here.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   612
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   613
+  indri::utility::HashTable<UINT64,const int>& unicode = _transcoder.unicode();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   614
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   615
+  int len = strlen( piatoktext );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   616
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   617
+  UINT64* unicode_chars = new UINT64[len + 1];
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   618
+  int* offsets = new int[len + 1];
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   619
+  int* lengths = new int[len + 1];
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   620
+  _transcoder.utf8_decode( piatoktext, &unicode_chars, NULL, NULL,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   621
+                           &offsets, &lengths );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   622
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   623
+  const int* p;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   624
+  int cls;             // Character class of current UTF-8 character
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   625
+  // offset of current UTF-8 character w/r/t toktext stored in offsets[i]
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   626
+  // byte length of current UTF-8 character stored in lengths[i]
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   627
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   628
+  int offset = 0;      // Position of start of current *token* (not character) w/r/t toktext
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   629
+  int extent = 0;      // Extent for this *token* including trailing punct
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   630
+  int piatoken_len = 0;   // Same as above, minus the trailing punctuation
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   631
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   632
+  char buf[64];
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   633
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   634
+  // If this flag is true, we have punctuation symbols at the end of a
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   635
+  // token, so do not attach another letter to this token.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   636
+  bool no_letter = false;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   637
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   638
+  // In case there are malformed characters preceding the good
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   639
+  // characters:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   640
+  offset = offsets[0];
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   641
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   642
+  for ( int i = 0; unicode_chars[i] != 0; i++ ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   643
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   644
+    p = unicode.find( unicode_chars[i] );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   645
+    cls = p ? *p : 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   646
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   647
+    if ( ! _tokenize_entire_words ) { // Tokenize by character
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   648
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   649
+      if ( cls != 0 && cls != 3 && cls != 5 && cls != 9 ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   650
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   651
+        writeToken( piatoktext + offsets[i], lengths[i],
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   652
+                    byte_position - piatokleng + offsets[i],
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   653
+                    byte_position - piatokleng + offsets[i] + lengths[i] );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   654
+      }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   655
+      continue;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   656
+    }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   657
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   658
+    // If this is not the first time through this loop, we need
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   659
+    // to check to see if any bytes in toktext were skipped
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   660
+    // during the UTF-8 analysis:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   661
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   662
+    if ( i != 0 && offset + piatoken_len != offsets[i] ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   663
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   664
+      // Write out the token we are working on, if any:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   665
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   666
+      if ( piatoken_len > 0 ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   667
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   668
+        writeToken( piatoktext + offset, piatoken_len,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   669
+                    byte_position - piatokleng + offset,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   670
+                    byte_position - piatokleng + offset + extent );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   671
+      }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   672
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   673
+      extent = 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   674
+      piatoken_len = 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   675
+      no_letter = false;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   676
+      offset = offsets[i];
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   677
+    }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   678
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   679
+    // Tokenize by word:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   680
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   681
+    switch ( cls ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   682
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   683
+    case 4: // Currency symbol: always extracted alone
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   684
+      // Action: write the token we are working on,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   685
+      // and write this symbol as a separate token
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   686
+      writeToken( piatoktext + offset, extent,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   687
+                  byte_position - piatokleng + offset,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   688
+                  byte_position - piatokleng + offset + extent );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   689
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   690
+      offset += extent;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   691
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   692
+      writeToken( piatoktext + offset, lengths[i],
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   693
+                  byte_position - piatokleng + offset,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   694
+                  byte_position - piatokleng + offset + lengths[i] );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   695
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   696
+      offset += lengths[i];
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   697
+      piatoken_len = 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   698
+      extent = 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   699
+      no_letter = false;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   700
+      break;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   701
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   702
+    case 1: // Apostrophe
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   703
+    case 10: // Decimal separator
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   704
+    case 6: // Letter
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   705
+    case 7: // Digit
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   706
+      // Action: add this character to the end of the token we are
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   707
+      // working on
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   708
+      if ( no_letter ) { // This is a token boundary
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   709
+        writeToken( piatoktext + offset, piatoken_len,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   710
+                    byte_position - piatokleng + offset,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   711
+                    byte_position - piatokleng + offset + extent );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   712
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   713
+        offset += extent;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   714
+        extent = 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   715
+        piatoken_len = 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   716
+        no_letter = false;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   717
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   718
+      }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   719
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   720
+      extent += lengths[i];
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   721
+      piatoken_len += lengths[i];
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   722
+      break;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   723
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   724
+    case 2: // Percent
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   725
+    case 8: // Punctuation
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   726
+    case 12: // Thousands separator
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   727
+    case 11: // Hyphen
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   728
+      // Action: These characters are included in the extent of the
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   729
+      // token we are working on.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   730
+      no_letter = true;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   731
+      extent += lengths[i];
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   732
+      break;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   733
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   734
+    case 0: // No character class!
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   735
+    case 3: // Control character
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   736
+    case 5: // Non-punctuation symbol
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   737
+    case 9: // Whitespace
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   738
+    default:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   739
+      // Action: write the token we are working on.  Do not include
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   740
+      // this character in any future token.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   741
+      writeToken( piatoktext + offset, piatoken_len,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   742
+                  byte_position - piatokleng + offset,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   743
+                  byte_position - piatokleng + offset + extent );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   744
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   745
+      offset += (extent + lengths[i]); // Include current character
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   746
+      extent = 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   747
+      piatoken_len = 0;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   748
+      no_letter = false;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   749
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   750
+      break;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   751
+    }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   752
+  }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   753
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   754
+  // Write out last token
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   755
+  if ( piatoken_len > 0 )
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   756
+    writeToken( piatoktext + offset, piatoken_len,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   757
+                byte_position - piatokleng + offset,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   758
+                byte_position - piatokleng + offset + extent );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   759
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   760
+  delete[] unicode_chars;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   761
+  delete[] offsets;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   762
+  delete[] lengths;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   763
+}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   764
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   765
+void indri::parse::TextTokenizerPIA::processASCIIToken() {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   766
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   767
+  int piatoken_len = strlen( piatoktext );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   768
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   769
+  // token_len here is the length of the token without
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   770
+  // any trailing punctuation.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   771
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   772
+  for ( int i = piatoken_len - 1; i > 0; i-- ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   773
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   774
+    if ( ! ispunct( piatoktext[i] ) )
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   775
+      break;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   776
+    else
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   777
+      piatoken_len--;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   778
+  }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   779
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   780
+  if ( _tokenize_entire_words ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   781
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   782
+    writeToken( piatoktext, piatoken_len, byte_position - piatokleng, byte_position );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   783
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   784
+  } else {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   785
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   786
+    for ( int i = 0; i < piatoken_len; i++ )
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   787
+      writeToken( piatoktext + i, 1, byte_position - piatokleng + i,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   788
+                  byte_position - piatokleng + i + 1 );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   789
+  }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   790
+}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   791
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   792
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   793
+// ObjectHandler implementation
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   794
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   795
+void indri::parse::TextTokenizerPIA::handle( indri::parse::UnparsedDocument* document ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   796
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   797
+  _handler->handle( tokenize( document ) );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   798
+}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   799
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   800
+void indri::parse::TextTokenizerPIA::setHandler( ObjectHandler<indri::parse::TokenizedDocument>& h ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   801
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   802
+  _handler = &h;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   803
+}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   804
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   805
+void indri::parse::TextTokenizerPIA::writeToken( char* token, int piatoken_len,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   806
+                                              int extent_begin, int extent_end ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   807
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   808
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   809
+  // The TermExtent for a token will include trailing punctuation.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   810
+  // The purpose for this is that it makes for a nicer display when a
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   811
+  // sequence of tokens (say, a sentence) is retrieved and shown to
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   812
+  // the user.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   813
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   814
+  TermExtent extent;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   815
+  extent.begin = extent_begin;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   816
+  extent.end = extent_end;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   817
+  _document.positions.push_back( extent );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   818
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   819
+  // The terms entry for a token won't include the punctuation.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   820
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   821
+  char* write_loc = _termBuffer.write( piatoken_len + 1 );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   822
+  strncpy( write_loc, token, piatoken_len );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   823
+  write_loc[piatoken_len] = '\0';
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   824
+  _document.terms.push_back( write_loc );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   825
+}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   826
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   827
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   828
--- indri-5.4/include/indri/TextTokenizerPIA.hpp	po črc 15 14:38:50 2013
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   829
+++ indri-5.4/include/indri/TextTokenizerPIA.hpp	po črc 15 14:36:54 2013
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   830
@@ -0,0 +1,73 @@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   831
+/*==========================================================================
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   832
+ * Copyright (c) 2003-2005 University of Massachusetts.  All Rights Reserved.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   833
+ *
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   834
+ * Use of the Lemur Toolkit for Language Modeling and Information Retrieval
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   835
+ * is subject to the terms of the software license set forth in the LICENSE
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   836
+ * file included with this software, and also available at
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   837
+ * http://www.lemurproject.org/license.html
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   838
+ *
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   839
+ *==========================================================================
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   840
+ */
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   841
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   842
+//
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   843
+// TextTokenizerPIA
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   844
+//
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   845
+// 15 September 2005 -- mwb
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   846
+//
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   847
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   848
+#ifndef INDRI_TEXTTOKENIZERPIA_HPP
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   849
+#define INDRI_TEXTTOKENIZERPIA_HPP
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   850
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   851
+#include <stdio.h>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   852
+#include <string>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   853
+#include <map>
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   854
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   855
+#include "indri/IndriTokenizer.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   856
+#include "indri/Buffer.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   857
+#include "indri/TagEvent.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   858
+#include "indri/UnparsedDocument.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   859
+#include "indri/TokenizedDocument.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   860
+#include "indri/UTF8Transcoder.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   861
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   862
+namespace indri {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   863
+  namespace parse {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   864
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   865
+    class TextTokenizerPIA : public Tokenizer {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   866
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   867
+    public:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   868
+      TextTokenizerPIA( bool tokenize_markup = true, bool tokenize_entire_words = true ) : _handler(0) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   869
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   870
+        _tokenize_markup = tokenize_markup;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   871
+        _tokenize_entire_words = tokenize_entire_words;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   872
+      }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   873
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   874
+      ~TextTokenizerPIA() {}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   875
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   876
+      TokenizedDocument* tokenize( UnparsedDocument* document );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   877
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   878
+      void handle( UnparsedDocument* document );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   879
+      void setHandler( ObjectHandler<TokenizedDocument>& h );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   880
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   881
+    protected:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   882
+      void processASCIIToken();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   883
+      void processUTF8Token();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   884
+      void processTag();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   885
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   886
+      indri::utility::Buffer _termBuffer;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   887
+      UTF8Transcoder _transcoder;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   888
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   889
+      bool _tokenize_markup;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   890
+      bool _tokenize_entire_words;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   891
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   892
+    private:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   893
+      ObjectHandler<TokenizedDocument>* _handler;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   894
+      TokenizedDocument _document;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   895
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   896
+      void writeToken( char* token, int token_len, int extent_begin,
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   897
+                       int extent_end );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   898
+    };
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   899
+  }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   900
+}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   901
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   902
+#endif // INDRI_TEXTTOKENIZERPIA_HPP
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   903
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   904
--- indri-5.4/src/TokenizerFactory.cpp	po črc 15 14:39:30 2013
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   905
+++ indri-5.4/src/TokenizerFactory.cpp	po črc 15 14:29:11 2013
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   906
@@ -22,6 +22,7 @@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   907
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   908
 #include "indri/TextTokenizer.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   909
 // Add an #include for your Tokenizer here.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   910
+#include "indri/TextTokenizerPIA.hpp"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   911
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   912
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   913
 #define TOKENIZER_WORD ("Word")
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   914
@@ -29,6 +30,8 @@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   915
 #define TOKENIZER_CHAR ("Char")
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   916
 #define TOKENIZER_CHAR_NO_MARKUP ("Char without Markup")
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   917
 // Add a #define for your Tokenizer here.
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   918
+#define TOKENIZER_PIA ("PIA")
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   919
+#define TOKENIZER_PIA_NO_MARKUP ("PIA without Markup")
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   920
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   921
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   922
 //
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   923
@@ -78,8 +81,23 @@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   924
     // got "char"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   925
     return TOKENIZER_CHAR;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   926
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   927
+  } else if ( ( name[0] == 'p' || name[0] == 'P' ) &&
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   928
+       ( name[1] == 'i' || name[1] == 'I' ) &&
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   929
+       ( name[2] == 'a' || name[3] == 'A' ) ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   930
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   931
+    if ( name[4] == '-' &&
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   932
+         ( name[5] == 'n' || name[5] == 'N' ) &&
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   933
+         ( name[5] == 'o' || name[5] == 'O' ) ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   934
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   935
+      // got "pia-nomarkup"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   936
+      return TOKENIZER_PIA_NO_MARKUP;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   937
+    }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   938
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   939
+    // got "pia"
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   940
+    return TOKENIZER_PIA;
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   941
   }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   942
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   943
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   944
   return "";
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   945
 }
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   946
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   947
@@ -105,6 +123,14 @@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   948
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   949
     tokenizer = new indri::parse::TextTokenizer( false, false );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   950
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   951
+  } else if ( preferred == TOKENIZER_PIA ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   952
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   953
+    tokenizer = new indri::parse::TextTokenizerPIA();
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   954
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   955
+  } else if ( preferred == TOKENIZER_PIA_NO_MARKUP ) {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   956
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   957
+    tokenizer = new indri::parse::TextTokenizerPIA( false );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   958
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   959
   } else {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   960
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   961
     LEMUR_THROW( LEMUR_RUNTIME_ERROR, name + " is not a known tokenizer." );
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   962
--- indri-5.4/src/FileClassEnvironmentFactory.cpp	po črc 15 14:40:19 2013
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   963
+++ indri-5.4/src/FileClassEnvironmentFactory.cpp	po črc 15 14:29:12 2013
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   964
@@ -189,6 +189,20 @@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   965
     trec_conflations      // conflations
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   966
   },
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   967
   {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   968
+    "trecpia",           // name
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   969
+    "xml",                // parser
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   970
+    "pia",               // tokenizer
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   971
+    "tagged",             // iterator
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   972
+    "<DOC>",              // startDocTag
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   973
+    "</DOC>",             // endDocTag
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   974
+    NULL,                 // endMetadataTag
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   975
+    trec_include_tags,    // includeTags
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   976
+    NULL,                 // excludeTags
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   977
+    trec_index_tags,      // indexTags
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   978
+    trec_metadata_tags,   // metadataTags
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   979
+    trec_conflations      // conflations
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   980
+  },
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   981
+  {
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   982
     "trecchar",           // name
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   983
     "xml",                // parser
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   984
     "char",               // tokenizer
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   985
--- indri-5.4/Makefile.app.in	2013-09-04 06:31:06.740210927 -0700
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   986
+++ indri-5.4/Makefile.app.in	2013-09-04 06:27:24.857989779 -0700
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   987
@@ -1,22 +1,26 @@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   988
+include MakeDefns
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   989
+
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   990
 ## your application name here
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   991
-APP=
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   992
+APP=pia_wrapper
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   993
 SRC=$(APP).cpp
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   994
 ## extra object files for your app here
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   995
 OBJ=
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   996
+OUTPUT=lib$(APP).so.1
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   997
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   998
 prefix = @prefix@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
   999
 exec_prefix = ${prefix}
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1000
 libdir = @libdir@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1001
 includedir = @includedir@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1002
-INCPATH=-I$(includedir)
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1003
-LIBPATH=-L$(libdir)
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1004
+INCPATH=-Iinclude -Icontrib/lemur/include
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1005
+LIBPATH=-Lobj
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1006
 CXXFLAGS=@DEFS@ @CPPFLAGS@ @CXXFLAGS@ $(INCPATH)
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1007
-CPPLDFLAGS  = @LDFLAGS@ -lindri @LIBS@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1008
+CPPLDFLAGS  = @LDFLAGS@ -lnvpair -lindri @LIBS@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1009
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1010
 all:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1011
-	$(CXX) $(CXXFLAGS) $(SRC) -o $(APP) $(OBJ) $(LIBPATH) $(CPPLDFLAGS)
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1012
+	$(CXX) $(CXXFLAGS) $(SRC) -fpic -shared -static-libgcc -h $(OUTPUT) -o $(OUTPUT) $(OBJ) $(LIBPATH) $(CPPLDFLAGS)
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1013
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1014
 clean:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1015
 	rm -f $(APP)
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1016
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1017
-
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1018
+install:
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1019
+	cp $(OUTPUT) $(libdir)
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1020
--- indri-5.4/Makefile	2013-09-12 07:39:16.027125829 -0700
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1021
+++ indri-5.4/Makefile	2013-09-12 07:38:44.720450641 -0700
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1022
@@ -73,5 +73,6 @@
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1023
 	$(MAKE) install -C doc
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1024
 	$(MAKE) -C site-search install
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1025
 	$(INSTALL_DATA) Makefile.app $(pkgdatadir)
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1026
+	$(MAKE) -f Makefile.app install
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1027
8dee2dfe2525 PSARC/2013/232 Indri
Vladimir Marek <Vladimir.Marek@oracle.com>
parents:
diff changeset
  1028
 test: