Getting Started

About Kite
Quick Start
MQL Query Language
Java APIs
Examples
Kite Features

Quick Start

To start working with Kite, simply download its binaries here, edit the settings file, and run the jar file kite-console-xx.jar on the cluster machines using Java 8 JVM (xx refers to Kite version number). Kite is a distributed system that run on commodity hardware clusters. Kite jar file should be executed on each machine seprately, no need to provide a list of machines in the cluster ahead of time. As a new machine runs Kite jar file, it is added to the cluster and automatically discovered by other up-and-running machines as long as they belong to the same network. When a running machine gets down, it is also automatically discovered by other machines. As introduced in About Kite, Kite writes its disk-based artifacts to Hadoop Distributed File System (HDFS). So, before start Kite machines, it is necessary to have an up-and-running HDFS instance. Check here to configure an HDFS cluster. Note that all machines of the same Kite instance should share the same settings of the underlying HDFS, as introduced in Kite Settings File.

After starting each Kite machine, it is ready to receive and execute MQL query language statements. In addition, the Kite jar file can be added as a dependant to Java projects to use Kite APIs in Java programs or with compatible programming languages. To gracefully stop a Kite machine, type quit or exit. Examples section provides sample MQL statements and queries as well as a ready-made example of a streaming data source to start using Kite immediately.

Main Features

Using Kite, system administrators can:

Connect Microblogs streams of arbitrary attributes and schema from local and remote sources.
Create index structures on arbitrary attributes of existing Microblogs streams. Kite provides both spatial and non-spatial index types.
Add and remove machines dynamically to Kite cluster as needed without restarting or interrupting the cluster operation.
Search existing streams using MQL query language and Java-compatible APIs. Kite automatically chooses the right index structures to access to process queries efficiently.
Manage and administrate existing streams and index structures with a variety of utility commands and tools.

Full details of supported features in Kite is maintained here.

Kite Settings File

On running the Kite jar file on each machine, the system administrator should provide a settings file. By default, Kite jar assumes a settings file named kite.settings and located in the same folder of the jar file. If the settings file is located elsewhere or named differently, it should be provided as a command line argument to the jar file.

The settings file include the HDFS settings as mandatory settings, in addition to other optional settings that allow system administrator to tune and control and system performance and behaviour. Kite settings file is a properties file that includes the following parameters:

hdfsHost: a mandatory URL to the master node of HDFS cluster. No default value provided. Example: hdfs://cs-cluster-1.cs.umn.edu:9000
hdfsRootDirectory: an optional path to the root HDFS directory. Default value is "/". Example: /user/admin/
hdfsUsername: a mandatory HDFS username for an authorized user with write permissions on the hdfsRootDirectory directory. No default value provided. Example: amr
hdfsGroupname: an optional HDFS groupname to which the hdfsUsername belongs to. No default value provided. Example: supergroup
queryAnswerSize: an optional integer to determine the default answer size that is returned by Kite queries. Default value is 20. Example: 100
queryTimeMinutes: an optional integer to determine the default search time horizon in the past. The value of 180 means searching the last 180 minutes (3 hours) of data. This default value can be overwritten by every single query, when a query does not indicate a search time period, this value is used. Default value is 180. Example: 10080
memoryIndexCapacity: an optional integer to determine the default in-memory index capacity in terms of number of data records. Default value is 1000000.
memoryIndexNumSegments: an optional integer to determine the default number of segments that is used to segment in-memory index structures. Default value is 5.
logsDirectory: an optional path to a folder to create system logs. Default value is the same directory of Kite jar file.

MQL Query Language

Kite come with a SQL-like query language, called Microblogs Query Language (MQL), that eases exploiting the system features to system administrators through a declaritive interface. MQL provides the main statements CREATE STREAM, CREATE INDEX, DROP STREAM, DROP INDEX, and SELECT to create and drop streams and index structures and query them. It also provides additional statements to manage and administrate the system assets: SHOW, UNSHOW, PAUSE, RESUME, ACTIVATE, DEACTIVATE, RESTART, and DESC statements. The usage of each statement is detailed below.

CREATE STREAM
CREATE INDEX
DROP INDEX
DROP STREAM
SELECT
SHOW
UNSHOW
PAUSE
RESUME
ACTIVATE
DEACTIVATE
RESTART
DESC

CREATE STREAM


        CREATE STREAM stream_name (att1:Type, att2:Type, att3:Type,... attn:Type) 

        FROM stream_source

        FORMAT stream_format


          Example:

          CREATE STREAM stream1 (id:Long, mtime:Timestamp, keyword:String, location:GeoLocation, username:String) 

          FROM Network_TCP(128.63.28.36:2334) 

          FORMAT CSV(0,1,4,3,2)

This statement creates and connects a new Microblog stream to the system.

stream_name: is an aribtrary unique name that consists of alphanumeric characters. Stream name cannot be an existing MQL keyword or COUNT keyword.
att1 to attn: are arbitrary attribute names, each consists of alphanumeric characters and underscores. Attribute name cannot be an existing MQL keyword or COUNT keyword. Each stream is expected to have an attribute named "timestamp" of data type Timestamp to provide the Microblog timestamp that is used in all queires and indexing operations. If there is no "timestamp" attribute exists in the stream, Kite adds this attribute to the stream and assigns it the Microblog arrival time. However, this added attribute cannot be queried, only user defined attributes can be queried.
Type: is a data type for the preceeding attribute name. Currently, Kite supports four data types: String, Long, Timestamp, and GeoLocation, as well as List type of each of them: List<String>, List<Long>, List<Timestamp>, and List<GeoLocation>. String and Long are similar to the primitive data types in major programming languages. Timestamp type is a Long value that represents the number of milliseconds since January 1, 1970, 00:00:00 GMT. GeoLocation is a spatial location data type that might represent a point location or a rectangular geographical area, both represented in geographic latitude/longitude system.
stream_source: is a streaming data source from which Kite stream continuously gets its data. Currently, Kite supports Network_TCP data source that bring data thorugh the network over a TCP connection. The source could be a localhost or a remote host. This source is instantiated with an IP address and port number, as shown in the example, to establish a TCP connection. Continuously, Kite stream sends the word "data" to the other side and receives an integer n followed by n lines, each represents a Microblog record. The subsequent versions of Kite is planned to support other source types including local file systems and Apache Kafka. An example TCP data source is provided in examples.
stream_format: is the format of the incoming lines from the stream_source. Currently, Kite supports Comma Separated Values (CSV) format, in which a Microblog comes from the source with attribute values separated by commas. As indicated in the example, CSV format in Kite is instantiated with a set of zero-based indicies that correspond to the stream attributes in order. In the example, id attribute comes at index 0, keyword attribute comes at index 4 and so on. If an attribute of data type List, then its index represents the starting item in the list and all subsequent items are added to the list. The subsequent versions of Kite is planned to support other formats including JSON. In all formats, all incoming attributes must have valid values. Long and Timestamp attributes should contain only digit characters. GeoLocation attributes must be formated as 2:latitude:longitude if it represents a point location or 4:north:south:east:west, all spatial coordinates must be valid floating point numbers.

Back to MQL

CREATE INDEX


        CREATE INDEX HASH index_name ON stream_name(attribute_name) [OPTIONS index_capacity, num_index_segments]


        CREATE INDEX SPATIAL spatial_partitioning_type index_name ON stream_name(attribute_name) 
 
        [OPTIONS index_capacity, num_index_segments, north, south, east, west, num_rows, num_cols]


          Examples:

          CREATE INDEX HASH index1 ON stream1(keyword)

          CREATE INDEX HASH index1 ON stream1(keyword) OPTIONS 2000000,20 

          CREATE INDEX SPATIAL GRID index2 ON stream1(location)

          CREATE INDEX SPATIAL GRID index2 ON stream1(location) OPTIONS 2000000,20,90,-90,180,-180,180,360

This statement creates a new index on an existing stream. Kite supports two families of index structures: hash index structure for any attribute and spatial index structures for spatial attributes. Each Kite index consists of two components, in-memory component and in-disk component. Both in-memory and in-disk components are segmented based on the time attribute. The in-memory component has a maximum capacity. When the maximum memory capacity is filled, the oldest data segment is flushed to the in-disk component.

index_name: is an aribtrary unique name that consists of alphanumeric characters.
stream_name, attribute_name: are names of an existing stream and one of its attributes.
spatial_partitioning_type: is the type of spatial partitioning that is used in the spatial index structure. Currently, Kite supports GRID spatial partitioning.
OPTIONS: an optional clause that controls the index parameters. Any parameter value can be replaced by the word "default" to indicate using the default value.
index_capacity: is the maximum number of Microblogs that the index holds in the main memory. The default value is 1000000.
num_index_segments: is the number of in-memory index segments. The default value is 5.
north, south, east, west: are the bounding box coordinates of the spatial grid index. The default value is the whole world bounding box.
num_rows, num_cols: are the number of rows and columns of the spatial grid index. The default values are 180 and 360, respectively.

Back to MQL

DROP INDEX


        DROP INDEX index_name stream_name


          Example:

          DROP INDEX index1 stream1

This statement drops an existing index.

index_name: is a name of an existing index to drop.
stream_name: is the name of the stream on which index_name is created.

Back to MQL

DROP STREAM


        DROP STREAM stream_name


          Example:

          DROP STREAM stream1

This statement drops an existing stream.

stream_name: is a name of an existing stream to drop. If any index structures are still active on this stream, they are all dropped as well.

Back to MQL

SELECT


        SELECT attribute_list FROM stream_name [WHERE condition] [TOPK k] [TIME time_interval]


          Examples:

          SELECT * FROM stream1

          SELECT id, keyword FROM stream1 TOPK 17

          SELECT id, keyword FROM stream1 WHERE keyword = obama

          SELECT id, keyword FROM stream1 WHERE keyword = obama TOPK 70 TIME [13 Jan 2017, 15 Jan 2017]

          SELECT id, keyword FROM stream1 WHERE (keyword = obama OR keyword=trump) AND location WITHIN [50,24,-122,-126] TOPK 50

This statement posts a query on an existing stream.

attribute_list: is a comma separated list of attribute names, or * to indicate all attributes.
stream_name: is a name of an existing stream.
condition: is either a simple condition or a compound condition. A simple condition is either a non-spatial attribute matched for equality, e.g., keyword=obama, or a spatial attribute matched for a spatial range, e.g., location WITHIN [50,24,-122,-126]. A coumpound condition is a set of simple conditions combined by the logical AND and OR operators. Parentheses are allowed in compound conditions without being nested.
k: is the number of records to return in the answer. The returned records are ordered based on time recency. The default value is 20.
time_interval: is the time search horizon of the query in the form [from_time, to_time]. from_time and to_time can be of any time granularity from days to milliseconds. The format can be any format that is accepted in the method java.util.Date.parse(String s). The default time horizon is the last three hours.

Back to MQL

SHOW


        SHOW stream_name


          Example:

          SHOW stream1

This statement shows the user the continuous insertion operations in an existing stream and its index structures.

stream_name: is a name of an existing stream to show.

Back to MQL

UNSHOW


        UNSHOW stream_name


          Example:

          UNSHOW stream1

This statement reverts the effect of a SHOW statement on an existing stream.

stream_name: is a name of an existing stream to unshow.

Back to MQL

PAUSE


        PAUSE stream_name


          Example:

          PAUSE stream1

This statement pauses data insertion in an existing stream and all its index structures.

stream_name: is a name of an existing stream to pause.

Back to MQL

RESUME


        RESUME stream_name


          Example:

          RESUME stream1

This statement reverts the effect of a PAUSE statement on an existing stream.

stream_name: is a name of an existing stream to resume.

Back to MQL

ACTIVATE


        ACTIVATE index_name stream_name


          Example:

          ACTIVATE index1 stream1

This statement activates insertion on an existing index structure.

index_name: is a name of an existing index to activate.
stream_name: is the name of the stream on which index_name is created.

Back to MQL

DEACTIVATE


        DEACTIVATE index_name stream_name


          Example:

          DEACTIVATE index1 stream1

This statement deactivates insertion on an existing index structure.

index_name: is a name of an existing index to deactivate.
stream_name: is the name of the stream on which index_name is created.

Back to MQL

RESTART


        RESTART stream_name


          Example:

          RESTART stream1

This statement restarts an existing stream and all its indexe structures. It is usually used after a system machine restarts to re-play a Microblog stream that was running on the machine before it gets down.

stream_name: is a name of an existing stream to restart.

Back to MQL

DESC


        DESC [stream_name]


          Examples:

          DESC 

          DESC stream1

This statement describes the system metadata. If a stream name is provided, the statement outputs a description for the given stream and all its index structures. If no stream name provided, then the statement describes all existing streams in the system, both active and paused ones.

stream_name: is an optional name of an existing stream to describe.

Back to MQL

Java APIs

All Kite features can be used through Java programs by adding the Kite jar file to the Java project and import edu.umn.cs.kite.*. Actually, all MQL statements are executed through translating them into the equivalent Java lines of code. In this tutorial, we describe how to launch a Kite machine and give the equivalent Java lines of code for each MQL statement.

Action	Java Code Snippets	Notes
Launch Kite Machine	`KiteLaunchTool kite = new KiteLaunchTool(); KiteInstance.initSettings(kite, settingsFilePath); or KiteInstance.initSettings(kite);//for default settings file`
Execute an MQL Statement	`String statement = "CREATE...."; parsingResults = MQL.parseStatement(statement); MQL.executeStatement(parsingResults);`	The parser returns a Boolean indicates a successful or failed parsing, a String error message in case of failed parsing, and a MetadataEntry in case of successful parsing.
CREATE STREAM ...	`StreamFormatInfo format = new StreamFormatInfo("csv", attrIndecies); Scheme scheme = new Scheme(attrList); Preprocessor preprocessor = new MicroblogCSVPreprocessor(format, scheme); StreamingDataSource source = new SocketStream (host,port, preprocessor); StreamDataset stream = new StreamDataset (name, source); KiteInstance.addStream(stream.getName(), stream); KiteInstance.addMetadata(streamMetadata);`
CREATE INDEX ...	`StreamDataset stream = new StreamDataset (...); stream.createIndexHash(index_attribute, index_name, index_capacity, num_index_segments, loadDiskIndex); stream.createIndexSpatial(index_attribute, index_name, new GridPartitioner(...), index_capacity, num_index_segments, loadDiskIndex); KiteInstance.addMetadata(indexMetadata);`	loadDiskIndex is true when the index previously exists in the system, and false otherwise.
DROP INDEX ...	`StreamDataset stream = KiteInstance.getStream(stream_name); stream.destroyIndex(index_name); KiteInstance.removeIndexMetadata(stream_name, index_name);`
DROP STREAM ...	`StreamDataset stream = KiteInstance.getStream(stream_name); stream.stop(); stream.destroyAllIndexes(); KiteInstance.removeStreamMetadata(stream_name); KiteInstance.removeStream(stream_name);`
SELECT ...	`StreamDataset stream = KiteInstance.getStream (stream_name); MQLResults results = stream.search (new Query(...), attributeNames);`
SHOW/UNSHOW ...	`StreamDataset stream = KiteInstance.getStream (stream_name); stream.show(); stream.unshow();`
PAUSE/RESUME ...	`StreamDataset stream = KiteInstance.getStream (stream_name); stream.pause(); stream.resume();`
ACTIVATE/DEACTIVATE ...	`StreamDataset stream = KiteInstance.getStream (stream_name); stream.activateIndex(index_name); stream.deactivateIndex(index_name);`
RESTART ...	`KiteInstance.restartStream(stream_name); KiteInstance.restartStreamIndexes(stream_name);`
DESC ...	`KiteInstance.descStream (stream_name); KiteInstance.descAllStreams ();`

Examples

Streaming Data Source Example

We provide an example streaming data source that work over network TCP connections. Kite users can download the data source binaries and source files from here. The jar file takes an input text file that is in this format, an example input file can be downloaded here. This data source takes data from a local file system folder. The data folder has one or more subfolders, each subfolder has one or more data file(s). A sample data folder can be downloaded from here. This example data source reads files that are compressed in GZip format and each line in the file represents one Tweet in JSON format as Twitter APIs format. To read other file formats, the following methods should be edited: method TextStream.openNextFile() to read file formats other than GZip and method TweetJSONPreprocessor.preprocess(String jsonTweet) to parse Tweet formats other than JSON.

MQL Examples

Kite users can download MQL statements examples here.

Kite Features

Kite main features are listed here. Full detailed of supported features in Kite is maintained here.