[Files] Find

This module allows you to locate a selection of various files that you want to process further in a following step or module. The module provides a variety of options to gather up a list of files from a specified location and allows the combination of options to allow for some extraordinarily rich filtering of what files you want it to identify. We will first start by just outlining the parameters the module uses in the following table.

Parameter I/O Description
FileSource In This should be the base folder (path) that you want to have the find module begin looking for eligible files. Note: while most other built-in modules support a filename pattern at the end of the folder path for this same named parameter, this is not supported in this module. The pattern feature is significantly more enhanced in this module and it is specified as a separate parameter.
FilenamePattern In This is a simple filename pattern that can be used like in many legacy applications using the asterisk * and question mark ? characters as wildcards for simple pattern matching. These simple patterns can be combined by using the vertical bar separator | as many times as needed. Not specifying a patten at all is the equivalent to specifying the all files *.* pattern. More examples in the section below, but some basic examples are:
*.jpg
*.jpg|*.jpeg
*.doc|*.docx|*.xls|*.xlsx
IncludeRegEx In This allows specifying a regular expression to find files from the source path. The matching is done by file name only, not on the entire path. Using a regular expression allows for complex selection of filename rules. Since regular expressions can be involved, we will provide more details and examples in the section below.
ExcludeRegEx In This allows specifying a regular expression to filter out any files from the source path that we do not want. The matching is done by file name only, not on the entire path. Since this uses a regular expression, it can be combined with both the filename pattern and include regex options to provide a rich set of rules for processing specifically named files in a source path.
IncludeSubfolders In True: The find operation will dig down into any sub-folders located in the source path and will include any files matched in the sub-folders. The sub-folders option is recursive and will keep looking for files any number of subfolders deep.
False (default): The find operation is only going to look for files located directly within the source path.
AgeType In Newer: Match only files newer than the thresholds specified below.
Older: Match only files older than the thresholds specified below.
None (default): The threshold options below have no effect on the matching files found.
ThresholdValue In An integer value for the age threshold. This is combined with the next option to tell us what the threshold is. So, we are able to input a value of 3 here and then choosing the next option is what differentiates the threshold from being set to 3 days or 3 weeks.
ThresholdDuration In Minutes, Hours, Days, Weeks, Months
Combining this with ThresholdValue above allows for selecting a wide range of aging options for file selection. Defaults to Minutes.
NoFilesFoundOutcome In ContinueJob (default): If no files are found, continue to the next step of the job.
StopJobWithSuccess: If no files are found, stop the job with a success status.
StopJobWithFailure: If no files are found, stop the job with a failure status to indicate that this might not be intended results.
LogOutputLevel In Minimal: Normal output to the log.
Verbose: More detailed output is written to the log, suitable for debugging purposes.
FileList Out This parameter lists all the files from the Find operation that the module was able to identify.

For most situations where you may just have a single folder where you pickup files for some process, you will often just be limiting yourself to just specifying the FileSource folder location and that might be it. At a minimum, you should consider specifying a filename pattern for all entries and limiting it to the specific type of file(s) you expect to see. For example, if you have a folder location where some other process always drops .log files that you are going to do something with and never expect to see any other types of files, you should really always use a *.log pattern as opposed to specifying a *.* pattern explicitly or not supplying a pattern at all.

Almost all Windows applications support this same filename pattern in various ways, but many times you are limited to only a single filename pattern. The filename pattern in this module allows you to specify nearly any number of filename patterns you want by simply separating each with the vertical bar (|), also referred to as the pipe character. So if you have a process that is going to read a variety of image files, and you can support multiple different formats, you might use a filename pattern like the following example. Notice how this allows you to handle files where sometimes some people or applications are in the habit of creating JPEG image files with either the .jpg or the .jpeg filename extension.

*.jpg|*.jpeg|*.gif|*.png|*.bmp

The IncludeRegEx option works like a filter as well and can be combined with whatever options you might have in the filename pattern. This helps in the example used in the filename pattern as you only have to build the regular expression to match on the filename portion and do not have to build out the final part of your regular expression to limit based on all the supported image filename extensions. It will match against just the filename portion of the file (it does not compare against the full source path or any optional sub-folders the file might be found in). Continuing with the example we started with above, we want to make sure our image files have what looks like a valid date in the format of YYYYMMDD as any part of the filename. We can do that with an expression like the following example.

^.*\d\d\d\d(0[1-9]|1[0-2])(0[1-9]|[12][0-9]|3[01]).*\..+$

The ExcludeRegEx option continues to filter this set of files and when this is used, it will exclude any files with the filename that match this regular expression. Because this can be used to filter out certain types of exclusions, it can help to keep your IncludeRegEx much shorter and easier to read without resorting to overly complex expressions. So again, we will continue with the example started above and we want to make sure we do not have any files which might have the text “_OLD” as the end part of the filename.

^.*_OLD\..+$

By combining these various parameters together, we wind up with a comprehensive set of rules that define what filenames we want to recognize in the source folder, while keeping the set of regular expressions we use to accomplish that to a more simplified and easier to read format. For a more visual introduction on how to utilize the combinations of filename patterns and the include/exclude regular expressions, try the article at the following URL for additional examples and more.

https://kb.jobserver.net/Q100032

The option to include subfolders is used when you want all the files in the specified path, and all the files located in any of the subfolders that fall underneath of that top level path. This option is recursive and will retrieve all the files located in the folder structure anywhere under the top-level path. When the FileList output for the [Files] Find module is connected to any other module that which can work with relative paths, the relative path information is included in the FileList output and allows such modules to offer enhanced functionality. For example, if the subfolders are populated with files, and you use it in conjunction with the [Files] Copy/Move module, one of the options in that module is to replicate subfolders on the destination. It is this enhanced meta-data that is encoded in the FileList output which allows this feature to work with the original relative hierarchy of files found in the source, when it is desired. Otherwise for all other modules which do not recognize the hierarchy data that [Files] Find can provide, they will just treat the list of files the same as a list from one source folder, ignoring the hierarchy information. This will be covered in more detail in the Copy/Move modules.

The next three parameters work together as a single set of options for this next bit of functionality we are about to review here. Normally, if these parameters are left at their defaults, then the files provided as output from the find operation will include all files that match all the above filtering options defined so far. But these next three give us an additional option which is incredibly useful. Normally all files are included regardless of the files timestamp. But this set of parameters provide us with a way to only deal with files that are older or newer than some time range based on the current execution time. The first of these parameters is AgeType, which defaults to a value of None, meaning that the option to further filter the files based on age is turned off and the other two parameters, ThresholdValue and ThresholdDuration, have no effect on the results. But we have the option to change AgeType to a value of either Newer or Older, which then enables this option, and these two other parameters now control the timeframe we want to focus on.

When the AgeType setting is changed to Newer or Older, we can now easily specify a timeframe for files that have been modified and we can narrow our focus to just those files. We can specify this timeframe all the way down to a period of minutes, up to a period of months. Of course, you can specify years, but you are just going to have to do a tiny bit of math to enter 36 Months if you want to specify three years. You could specify a timeframe of only files older than 90 days by setting the three parameters to AgeType: Older; ThresholdValue: 90; ThresholdDuration: Days. You might connect a [Files] Purge module to this to cleanup a folder of old files you may be using in some other process. Going the other direction, you could specify a timeframe of only files newer than 60 minutes by setting the three parameters to AgeType: Newer; ThresholdValue: 60; ThresholdDuration: Minutes. You might connect this to a process that only runs once an hour for importing data files and need to evaluate only the most recently updated file(s).

The parameter NoFilesFoundOutcome, provides a way to override some of the default behavior of the module. The default value for this parameter is ContinueJob and we will see how the other options are different from this in just a moment. Normally, once all your filtering options are combined and the module finds all the matching files, it emits them through the FileList output parameter. It does this if it finds a few files, thousands of files, or no files at all. This means that when no files are found matching your specifications, that result will still be passed onto the next step or module in your job. Normally, this should be fine for most modules, as a well written module should handle an empty input list properly. But there can be times when either a specific module does not handle a list of no files as input, or it just makes sense that if [Files] Find did not actively find any files matching your search request, then maybe the job should stop at this point. It is this last condition that this option becomes particularly useful.

By changing this parameter to either StopJobWithSuccess or StopJobWithFailure, the module will stop the job execution from proceeding any further in the event the find operation would return no matching files. The only difference between these options is if it causes the step to be flagged as completing successfully or not. If the fact that the find operation found no files and there is no further work to do in the job, then stopping it with a success status would make sense as this would be no cause for concern, and it is safe to stop any further processing here. Otherwise, if a particular job always expects to find one or more files when it is run, and the fact that no files were found on a given run might indicate some sort of problem, then setting it to the failure option would allow it to stand out in your log and gives you the ability to trigger DevOps notifications in the job definition if desired.

The parameter LogOutputLevel controls the amount of detail that is included in the log activity when the job is executed. For [Files] Find, there is not normally a reason you might need to set a higher level of logging detail. The exception for this might be if you are setting up a non-trivial combination of options and want more information about what it is using and the results it generates recorded in the log, then setting this to the higher verbose option can be useful.