There are two native ways to filter and process incoming events before they’re indexed by Splunk. Filtering and processing with TRANSFORMS and SEDCMD are done either as data passed thru a heavy-forwarder (HF) or when it arrives, unparsed, at an indexer. I emphasize unparsed here because if the data passes through an HF the receiving indexer will not attempt to parse it.
SEDCMD is a props.conf
only directive that applies to _raw (event text). It’s typically used for quick find and replace. TRANSFORMS are a bit more complex (not that much really) but are optimized for flexibility and re-usability. They are defined in transforms.conf
, called via props.conf
and can be applied to _raw, keys or fields available at the time of invocation.
Let’s now take look and see how we can filter, modify and anonymize data using either of the above methods. Let’s use this sample data, sourcetyped as sensitive-data
, as our example:
2018–07–04 11:11:56 Event=UpdateBilling, orderType=PlanChange, credit_card=4111111111111111, credit_score=730, esn=975526D39743E8, accountNumber=900019092, orderNumber=32968936, userName=Trungx, email=Trungxfoo@example.com
- Anonymize
credit_card
andcredit_score
values:
SEDCMD
[sensitive-data] <- props.conf
SEDCMD-cc = s/(credit_card|credit_score)=\d+,/\1=#####,/g
TRANSFORMS
[sensitive-data] <- props.conf
TRANSFORMS-anon = anonymize-cc, anonymize-cs
[anonymize-cc] <- transforms.conf
REGEX = (.*?credit_card=)\d+(.*)
FORMAT = $1#####$2
DEST_KEY = _raw
REPEAT_MATCH = true
[anonymize-cs] <- transforms.conf
REGEX = (.*?credit_score=)\d+(.*)
FORMAT = $1#####$2
DEST_KEY = _raw
REPEAT_MATCH = true
New Output:
2018–07–04 11:11:56 Event=UpdateBilling, orderType=PlanChange, credit_card=#####, credit_score=#####, esn=975526D39743E8, accountNumber=900019092, orderNumber=32968936, userName=Trungx, email=Trungxfoo@example.com
2. Filter out all events with pattern esn=*
[sensitive-data] <- props.conf
TRANSFORMS-drop = drop-with-esn
[drop-with-esn] <- transforms.conf
REGEX = esn=\d+
DEST_KEY = queue
FORMAT = nullQueue
All events that match esn=\d+
will be send to the nullQueue
, which is a fancy way of saying dropped.
3. Create a new field called accountRegion
from the first four digits of the accountNumber
[sensitive-data] <- props.conf
TRANSFORMS-addregion = accountRegion
[accountRegion] <- transforms.conf
REGEX = (.*?accountNumber=)(\d{4})(\d+.*)
FORMAT = $1$2$3, accountRegion=$2
DEST_KEY = _raw
REPEAT_MATCH = true
New Output:
2018–07–04 11:11:56 Event=UpdateBilling, orderType=PlanChange, credit_card=4111111111111111, credit_score=730, esn=975526D39743E8, accountNumber=900019092, orderNumber=32968936, userName=Trungx, email=Trungxfoo@example.com, accountRegion=9000
Note that an endpoint reload is required unless your props/transforms configurations are on an indexer cluster in which case reloading happens automatically.
Tips & Tricks
- Splunk processes data in a (linear) pipeline and, naturally there is an order of operations. SEDCMD is applied before TRANSFORMS.
- Try to make your regular expressions as specific as possible to minimize processing overhead.
- Avoid applying SEDCMD and TRANSFORMS on all data. I.e. limit their application to specific sourcetype, sources or hosts.
- Having numerous regular expression make it hard to manage and troubleshoot. Use them only when you need them.
Hope this helps. Simple replacements with SEDCMD and TRANSFORM Scan get you far, but if you’d like an easier and more powerful way of solving this problem we would like to hear from you. We are building a product focused on the problem, and a community designed to share solutions for common data types.
Please contact us at sluice@diag.ai for more information.