Cybersecurity
Mining Splunk's Internal Logs
This content is provided "as is" and is more than a year old. No representations are made that the content is up-to date or error-free.
Splunk is great about logging its warnings and errors, but it won’t tell you about them – you have to ask!
As the leading machine-generated data analysis software, it’s not surprising that Splunk excels at creating robust logs. The current version of Splunk Enterprise (v 8.05) generates 22 different logs (for a complete current list, see: What Splunk logs about itself). These logs don't consume license usage, so other than disk space, there is no downside to all this logging, and the information the logs provide can be eye-opening. The challenge for the Splunk administrator is getting a handle on these logs and using them to troubleshoot issues, find unknown errors, and improve performance.
The most critical log to master is splunkd.log which logs events for the splunk daemon. The SPL to query the splunkd logs is a bit more complicated than it probably should be. The easy part is setting the index, since all Splunk's internal logs are conveniently kept in the _internal index. Sourcetype is more complicated, because while there is a splunkd sourcetype, there are five other logs (splunkd_access.log, splunkd_stdout.log, etc.) that share this sourcetype. Source is also problematic because the location (and therefore source) of the splunkd.log varies depending on product and OS. For example, the Windows Universal Forwarder is stored in splunkd.log:
C:\Program Files\SplunkUniversalForwarder\var\log\splunk\splunkd.log
But, on a Linux computer running Splunk Enterprise, the log is at:
/opt/splunk/var/log/splunk/splunkd.log
Luckily, Splunk comes with a pre-defined event type for splunkd.log (eventtype="splunkd-log") which is defined as:
index=_internal source=*/splunkd.log OR source=*\\splunkd.log
That simplifies the SPL for all your splunkd log events from all servers and forwarders to: eventtype="splunkd-log".
To limit our search to just important events, you need to specify the desired log level in our SPL. The splunkd.log has five log levels: DEBUG, INFO, WARN, ERROR, and FATAL. Debug is turned off by default and Info describes expected events. So, to get the important events, you can use the query:
eventtype="splunkd-log" (log_level=WARN OR log_level=ERROR OR log_level=FATAL)
This search will include dispatch manager warnings for when users exceed their quotas of concurrent searches. I don't really feel this is a splunkd warning as much as a user warning, so I filter these out of the query. You may want to create a separate search with component=DispatchManager to monitor user quotas. The final search is:
eventtype="splunkd-log" (log_level=WARN OR log_level=ERROR OR log_level=FATAL) component!=DispatchManager
As a matter of course, this search (with the addition of a host clause) should be run whenever a change is made. You may not think this is necessary, but in large Splunk environments, even basic maintenance tasks like rebooting a server or upgrading an app can have unforeseen consequences. (I once had four million errors in a single day from an indexer with bucket replication errors after the normal monthly OS patching.) Running this search will let you know if something went wrong so you can correct it before it gets out of hand.
Splunk errors and warnings chart
Now that you have an SPL query for the splunkd.log file, you can easily create a daily report of errors by host, so that with just a quick glance you can tell the health of your environment. Errors often happen in relation to load, so the query shows the tally for each of the last eight days. This way, you can compare yesterday's results to the week before. I set up my report as a stacked column chart printed in landscape and include the results as well as the chart in the report. The query for this report is:
index=_internal source="/opt/splunk/var/log/splunk/splunkd.log" (log_level=WARN OR log_level=ERROR) component!=DispatchManager earliest=-8d@d latest=-0d@d
| timechart count by host span=1d limit=10
Splunk errors and warnings table
So far, you have created two queries for reviewing splunkd errors and warnings: a low-level report of events and a high-level chart by day and host. A third report is a rollup of similar errors. That way, you can see which errors are occurring most often in your environment.
For your report, besides the log level, error message and count, you might also want to see the first time the error occurred and the last time the error occurred. That way, you will know if the error is ongoing or time-specific. I'm going to leave the dispatch errors (user quota errors) in this report, since it will allow me to see if someone is constantly coming up against their search quota. Finally, you want the report to include the list of computers that have both errors so you can know if an error is on only one server or is deployment wide. The SPL for this report is:
eventtype="splunkd-log" (log_level=ERROR OR log_level=WARN OR log_level=FATAL)
| stats count values(host) earliest(_time) as FirstError latest(_time) as LastError by component, log_level, event_message
| convert timeformat="%H:%M:%S" ctime(FirstError) ctime(LastError)
| sort - count
This report is good, but we can do better. What you really want to see is a tally of errors by type, but Splunk does not have error IDs, so you will need to find some other way to group the log events. You can try to use the punct field for grouping, but that doesn't work very well because many Splunkd.log events contain GUIDs that have varying punctuation elements. Since the numbers in events are often different even when the words are the same, let’s replace all numbers with a pound sign. Also, let’s replace all space characters with a space. Our new query is:
eventtype="splunkd-log" (log_level=ERROR OR log_level=WARN OR log_level=FATAL)
| eval event_summary=replace(event_message,"\w*\d\w*","#") | eval event_summary=replace(event_summary, "\s+", " ") | stats count values(host) earliest(_time) as FirstError latest(_time) as LastError by component, log_level, event_summary | convert timeformat="%H:%M:%S" ctime(FirstError) ctime(LastError)
| sort - count
Hopefully these queries will help you manage your Splunk environment better, and you can of course tweak them to fit your needs. For example, you may want to limit results to only the top errors, or only errors from your production servers. It's totally up to you based on your environment. I suggest that you get at least one report delivered to your mailbox every morning. That way you'll always have an idea of the overall health of your environment.
Splunk errors and warnings dashboards in department apps
This is the secret sauce for easy Splunk dev-ops. About half the time, ingest errors in Splunk are due to changes made on the forwarders, not the indexers. Errors can happen when the forwarder admins change permissions, delete the splunk user account, upgrade a monitored application, repurpose a server, etc. When these happen, the forwarder admin knows the root cause of the error but the Splunk admin does not. Rather than having the Splunk admin initiate an error investigation only to find out the forwarder admin changed a setting, flip the paradigm so the forwarder admin lets the Splunk admin know if something is amiss. This is easily accomplished by adding a dashboard showing applicable errors and warnings to all custom/department apps. With visibility into their Splunk errors, departments can take greater ownership of their Splunk app and escalate issues to you that they need help resolving. This model doesn’t work for all organizations, but generally having users take greater ownership of their data in Splunk will ultimately increase the usage of Splunk in your organization, which is a good thing for everyone.
SPL queries for splunkd.log events
Check for issues on a host
eventtype="splunkd-log" (log_level=WARN OR log_level=ERROR OR log_level=FATAL) component!=DispatchManager host=myhost
Create an eight-day bar chart of all issues
index=_internal source="/opt/splunk/var/log/splunk/splunkd.log" (log_level=WARN OR log_level=ERROR) component!=DispatchManager earliest=-8d@d latest=-0d@d
| timechart count by host span=1d limit=10
Create a grouped report of all issues on all hosts
eventtype="splunkd-log" (log_level=ERROR OR log_level=WARN OR log_level=FATAL)
| eval event_summary=replace(event_message,"\w*\d\w*","#")
| eval event_summary=replace(event_summary, "\s+", " ")
| stats count values(host) earliest(_time) as FirstError latest(_time) as LastError by component, log_level, event_summary
| convert timeformat="%H:%M:%S" ctime(FirstError) ctime(LastError)
| sort - count