Matthew Note

ElasticSearch Note

ElasticSearch


Concept

index/indices

index包含多个type

type

type包含多个documents

documents

每一条记录是一个documents

Restriction

  • 对于SQL类似的Join命令,ES并不擅长,会消耗大量的计算cycle,间接上只支持子查询

Index Limitation

Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards api.

Analyzed String

默认所有string进来都是analyzed,这意味着可能会被分词,所以要在插入数据之前用MAPPing的API设置好MAPPing API

支持lucence的语法Lucence语法和JSON的DSL语法

  • Lucence
    Enter a query string in the Search field:
    1. To perform a free text search, simply enter a text string. For example, if you’re searching web server logs, you could enter safari to search all fields for the term safari.
    2. To search for a value in a specific field, you prefix the value with the name of the field. For example, you could enter status:200 to limit the results to entries that contain the value 200 in the status field.
    3. To search for a range of values, you can use the bracketed range syntax, [START_VALUE TO END_VALUE]. For example, to find entries that have 4xx status codes, you could enter status:[400 TO 499].
    4. To specify more complex search criteria, you can use the Boolean operators AND, OR, and NOT. For example, to find entries that have 4xx status codes and have an extension of php or html, you could enter status:[400 TO 499] AND (extension:php OR extension:html).
  • 支持夸index查詢
  • Automatic index creation can be disabled by setting action.auto_create_index to false in the config file of all nodes. Automatic mapping creation can be disabled by setting index.mapper.dynamic to false in the config files of all nodes (or on the specific index settings).
  • The index operation can be executed without specifying the id. In such a case, an id will be generated automatically. In addition, the op_type will automatically be set to create. Here is an example (note the POST used instead of PUT)
  • A child document can be indexed by specifying its parent when indexing. When indexing a child document, the routing value is automatically set to be the same as its parent, unless the routing value is explicitly specified using the routing parameter. 需要設置mapping,dynamic mapping不能自动配置parant/child
  • If you only need one or two fields from the complete _source, you can use the _source_include & _source_exclude parameters to include or filter out that parts you need. This can be especially helpful with large documents where partial retrieval can save on network overhead. Both parameters take a comma separated list of fields or wildcard expressions. Example:
1
2
3
curl -XGET 'http://localhost:9200/twitter/tweet/1?_source_include=*.id&_source_exclude=entities'
# shorter notation
curl -XGET 'http://localhost:9200/twitter/tweet/1?_source=*.id,retweeted'
  • inner_hit用来处理nested/child/parent search, 不然他会返回最上级的结果,而不是你期待的,似乎ES本身不支持nested,nested也是会被flatten
  • global aggregation, 可以控制aggs不受query结果影响, 与此类似reverse_nested 可以取回他的父结构的element
  • significant Terms???
  • pipeline aggregation, 还在试验阶段,这个牛逼
  • Complex Core Field Types

filter_path

URI可以指定一个filter_path参数来表示返回那些值,例如:

1
2
3
4
5
6
7
8
curl -XGET 'localhost:9200/_nodes/stats?filter_path=nodes.*.ho*'
{
"nodes" : {
"lvJHed8uQQu4brS-SXKsNA" : {
"host" : "portable"
}
}
}

Bucket Aggregations

Aggregation的结果用于后续bucket或者metric aggregation使用,它只是取得子集Bucket Aggregations
Buckets are analogous to SQL GROUP BY statements

Metric Aggregations

Aggregation成可度量的数字Metric Aggregations

Scripted fields

Scripted fields use the Lucene expression syntax. For more information, see Lucene Expressions Scripts.

Conclusion

  • nested 数据结构,导致insert的资源耗费上升大概50%,因为ES对于update采用的是DELETE->INSERT的方法,所以额外开销在这里,另外查询也会耗费一定时间,用indexID检索可以忽略这部分
  • 正常的syslog like结构,插入数据高效,但是结构并不是特别清晰
  • 官方文档上说Child/Parant方式查询效率远比nested低Practical Considerations, 如果index时间要求比search要求高,那么可以考虑用parant/child模式
  • AppendPartial Updates to Documents数据只能使用script方式,默认是关闭的,因为他有注入的风险,所以谨慎使用Scripting
  • include_in_parent 可以把nested的数据,以flat的格式加到他的父节点上,但是这个功能似乎已经deprecate了,开了这个速度有所下降,因为他会反过来搜索parent的element
  • 设计一个以时间为主线的Log分析系统的时候建议使用index_name+time作为index名字,这样更加方便管理和做shard

Kibana


  • filter is global??
  • Include/Exclude Pattern
  • Include/Exclude Pattern Flags
  • JSON input
  • Discover的query不能是一个最终统计值,只是检索数据
  • Sense 是个好插件,省了用Curl了

Restriction

  • 暂时不支持nested、parant/child的可视化,但是已经有人在做了,不久的将来应该会有
  • table不支持导出
  • Discover展开数据也不够友好

Requirement & Desgin


Runtime

实际上Runtime数据不应该是一个log analyzer应该处理的事情,但是我们依旧暂时假定支持这样做

  • Active user: 检查session template是否有endtime
  • Active user’s Realm/Zone/Community/IP

Statistics

  • TopN Resource的inbound/outbound总和
  • TopN User的inbound/outbound总和
  • 总的Realm/Zone/Community个数
  • 最常访问资源TopN
  • inbound/outbound流量TopN资源
  • AccessAgent总个数
  • Platform总个数
  • 地理位置范围
  • TopN failure profile

以上都可以做细致区分,比如Active User的如上结果,Web Access的如上结果,Tunnel的如上结果,已经断开连接用户的如上结果

是否是web用户可以通过是否有IP来确定,但是有误差

Design

基本包含两个数据类型:

  • SessionInfo: 包含Username, Realm, Community, ClientIP, assigned IP, starttime, endtime, failure profile, Zone, Platform, AccessAgent, 所有这些信息可以通过partial update, 来在后期插入,资源开销基本可控, sessionInfo都以teamID作为_id
  • AccessInfo: 包含teamID, destination, port, protocol, inbound, outbound, 以及所有SessionInfo的数据(可以尝试通过parent/child方式实现), 实际上我们只需要复制一些基础的信息,比如teamID,username,两个IP,accessAgent, 对于其他的item和access resource没有直接关系,也不大可能去分析他们之间的关系,为了这个可能需要KernelSession发送一个teamID(12bytes)出来
  • CurrentUser: 一个单一的数据,只通过partial update更新
Type Public Internal
Modules lower_with_under _lower_with_under

TODO


  • 定义timestamp
  • 检索所有不重复的teamId列表(nested query)
  • scripted update可以自加 自減,对于concurrent user可能有用Update Doc, access 记录都加到同一个document下面(会不会有性能问题),kibana支持nested aggregation
  • Upsert 可以处理存在更新, 不存在创建的问题