Skip to content

Instantly share code, notes, and snippets.

@RaminHAL9001
Last active November 18, 2017 15:54
Show Gist options
  • Save RaminHAL9001/22bcb9c32786f089fb973444adfa0619 to your computer and use it in GitHub Desktop.
Save RaminHAL9001/22bcb9c32786f089fb973444adfa0619 to your computer and use it in GitHub Desktop.
<html><head>
<style type="text/css">
div.top-level {
width: 20cm;
left-margin: 0.5cm;
right-margin: 0.5cm;
}
p, ol, td, blockquote {
font-family: serif;
color: #000000;
line-height: 1.5em;
}
h1 {
font-family: serif;
color: #000000;
}
h2 {
font-family: serif;
color: #000000;
}
h3 {
font-family: serif;
color: #000000;
}
code {
background-color: #F0F0F0;
color: #400000;
}
pre {
background-color: #F0F0F0;
color: #000000;
line-height: 1.4em;
}
.vocabulary-word {
font-weight: bold;
font-style: oblique;
color: #008000;
vertical-align: top;
}
span.prompt {
color: #808080;
}
span.file-name, a.file-name {
font-family: monospace;
text-decoration: underline;
color: #204020
}
.user-input {
color: #400000;
}
a.section-link {
color: #000080;
text-decoration: none;
}
a:hover.section-link {
color: #0000FF;
text-decoration: underline;
}
code.single-char {
color: #400000;
border: 0.0625em solid;
border-radius: 0.25em;
background-color: #F0F0FF;
font-size: 1.4em;
}
.output {
color: #000040;
}
span.keystroke {
color: #000000;
border: 0.0625em solid;
border-radius: 0.25em;
font-family: sans-serif;
font-style: oblique;
}
code.token {
color: #0000F8;
border: 0.0625em solid;
border-radius: 0.25em;
background-color: #F0F0FF;
}
td.source-code {
background-color: #FFFFFF;
border: 1px solid black;
padding: 2px;
}
pre.source-code {
background-color: #FFFFFF;
}
table.source-code {
background-color: #F0F0F0;
border: 1px solid black;
margin: 20px;
}
</style>
<title>Bash Basics: How The Command Shell "Sees" the Words it Reads</title>
</head><body>
<div class="top-level">
<h1>Bash Basics: How The Command Shell "Sees" the Words it Reads</h1>
<p>As someone who can read, you may take for granted that every word is
separated by a space. But to an unintelligent computer program like Bash, the
process of breaking input into individual words must be defined in computer
code as a grammatical algorithm.
<p>Often times it can be very helpful if you understand these rules. When you
enter a command and Bash does not do what you expected, could it be because it
is simply reading or <q>understanding</q> your command in a way that you don't
expect it to? Often this is the problem, and a thorough understanding of how
Bash actually reads and understands commands can make your life much easier as
you become more skilled in using Ubuntu, Linux, or MacOS.</p>
<p>Fortunately, you don't need to be an expert to understand the grammatical
algorithm of Bash's tokenizing grammar, in fact most of the ordinary Bash token
grammar is quite simple for anyone to understand, although there are a few
complicated rules that experts need to worry about, but we will will worry
about that another day. Lets keep things simple for now:</p>
<em><b>When you type anything into Bash, the first thing it does is break down what
you typed into a list of words called <q>tokens</q>.
</b></em>
<p>Once Bash has it's list of words (tokens), it then <q>thinks</q> about each
word one by one. In this lesson, we learn the six most basic rules Bash uses to
read the command you typed, and how it breaks your command down into tokens
that it can understand and think about individually. There are actually a few
more than six rules, but this chapter goes over the most basic rules.</p>
<ol>
<li> <a class="section-link" href="#hash_tags">Ignore hash tags.</a>
<li> <a class="section-link" href="#space_separated">Tokens are separated by spaces (usually).</a>
<li> <a class="section-link" href="#quotes">Tokens can have spaces in them if they are quoted.</a>
<li> <a class="section-link" href="#join_adjacent">Tokens that are not separated by spaces are joined together.</a>
<li> <a class="section-link" href="#special_punctuation">Some punctuation marks are special and are not joined together.</a>
<li> <a class="section-link" href="#backslash">The backslash turns a special punctuation mark into ordinary token.</a>
</ol>
<table class="glossary"><thead class="glossary" colspan="2"><h3>Terminology</h3></thead>
<tr>
<td class="vocabulary-word"><span class="vocabulary-word">Token</span>:</td><td>A single, atomic unit of computer code that is constructed
from a sequence of <q>characters</q> in accordance with the grammatical rules
of the computer language.</td></tr>
<td class="vocabulary-word"><span class="vocabulary-word">Character</span>:</td><td>A single letter, number, punctuation mark, or whitespace
value which contains the smallest amount of human-readable information.</td></tr>
<td class="vocabulary-word"><span class="vocabulary-word">String</span>:</td><td>A unit of data containing a sequence of characters. A string
is different from a token in that tokens are elements taken out of a computer
program according to token grammar rules, whereas a string can contain any data
without regard for token grammar. In the Bash programming language there is
almost no practical difference between tokens and strings, but many other
programming languages do not allow you to treat tokens as strings.</td>
</tr>
</table>
<h2>Before we begin...</h2>
<p>Bash is everywhere, so it is incredibly easy to open a Terminal window and
just start experimenting. It takes no effort, and you can do it any time you
want. So as you read, there is no need take this instruction as mere computer
science theory, you can actually put theory into practice!</p>
<p>So before we begin learning the rules, here is a two-line Bash program you
can try right now to experiment with the various examples below. Enter this
code into the TextEdit program and save the file as <q><a name="#tokenizer.sh" class="file-name">tokenizer.sh</a></q> in your
Home folder: </p>
<table class="source-code">
<thead>
<tr><td>
&#x1F5CE; <a name="tokenizer.sh"><span class="file-name">~/tokenizer.sh</code></a></span>
</td></tr></thead>
<tr><td class="source-code">
<pre class="source-code">
#!/bin/bash
( for x in "${@}" ; do echo "$x"; done; ) | cat -n
</pre>
</td></tr>
</table>
<p>Now open Terminal and check if you did it right:</p>
<pre>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">cat tokenizer.sh</span>
<span class="output">#!/bin/bash</span>
<span class="output">( for x in "${@}" ; do echo "$x"; done; ) | cat -n</span>
</pre>
<p>If instead you get an error:</p>
<pre>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">cat tokenizer.sh</span>
<span class="output">cat: tokenizer.sh: No such file or directory</span>
</pre>
<p>then please check to make sure you saved the <q><a href="#tokenizer.sh" class="file-name">tokenizer.sh</a></q> in the Home
folder, or else use the <code>cd</code> command to change to the directory in
which you did save the <q><a href="#tokenizer.sh" class="file-name">tokenizer.sh</a></q> file.
</p>
<p>Lets try running the <q><a href="#tokenizer.sh" class="file-name">tokenizer.sh</a></q> in the Terminal. Enter the text
<code class="user-input">bash tokenizer.sh This is an example.</code> as the
command text, the rest will be generated by Bash as soon as you press the enter
key. The whole interaction will look like this in your Terminal window:</p>
<pre>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">bash tokenizer.sh This is an example.</span>
<span class="output"> 1. This</span>
<span class="output"> 2. is</span>
<span class="output"> 3. an</span>
<span class="output"> 4. example.</span>
<span class="prompt">YourName@ComputerName:~$ </span>
</pre>
<p>Did it work? Great! So from now on, if you see an example in the text below,
which looks like this:</p>
<div><code>This is an example.</code></div>
<ol>
<li><code class="token">This</code></li>
<li><code class="token">is</code></li>
<li><code class="token">an</code></li>
<li><code class="token">example.</code></li>
</ol>
<p>don't be afraid to try the example out using <q><span class="file-name">tokenizer.sh</span></q>.</p>
<h3>If you get stuck...</h3>
<p>Never forget that <span class="keystroke">Ctrl C</span> will <B>C</b>ancel
the command you were typing and let you start again from nothing.</p>
<p>Occasionally you may mis-type something and Bash will freeze. For example,
if you type an apostrophe <code class="single-char">'</code> all alone: </p>
<pre>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">bash tokenizer.sh Why won't this work?</span>
<span class="prompt">&gt; </span>
<span class="prompt">&gt; </span>
<span class="prompt">&gt; </span><span class="user-input"> echo try again</span>
<span class="prompt">&gt; </span>
<span class="prompt">&gt; </span>
<span class="prompt">&gt; </span><span class="user-input"> askdjaczxca</span>
<span class="prompt">&gt; </span><span class="user-input"> jasiuhq fidjfn ZXzxucaccas asdasdf </span>
<span class="prompt">&gt; </span>
<span class="prompt">&gt; </span>
<span class="prompt">&gt; </span><span class="user-input"> aaaaaaaaaaaaaaaaaaaaaaaaaaaa </span>
<span class="prompt">&gt; </span>
</pre>
<p>I kept pressing <span class="keystroke">Enter</span> but the command prompt
never came back, all I got was the <code class="single-char">&gt;</code>
symbols, and it wouldn't do anything! What is happening here is that the
apostrophe is actually a opening single-quote character (discussed in <a
href="#quotes" class="section-link">rule #3</a>) and Bash waits for you to write the closing
single-quote character; it waits even after you press the Enter key.
Double-quotes <code class="single-char">&quot;</code> will cause the same
problem, as will parentheses or <code class="single-char">(</code> brackets
<code class="single-char">{</code>, as we will see with <a class="section-link"
href="#special_punctuation">rule #5</a>.</p>
<p>If you ever make this mistake, <span
class="keystroke">Ctrl C</span> is your friend.
</p>
<h3>So lets get started learning about the Bash tokenizer rules!</h3>
<hr />
<h2><a name="hash_tags">Rule 1: Ignore hash tags</a></h2>
<p>This is the simplest rule: the <code class="single-char">#</code> character
is ignored. You can use this to write comments to yourself:</p>
<pre>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">#This line starts with a hash tag. It does absolutely nothing.</span>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">#But you can use a hash tag in the middle of a command as well:</span>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo Some people #just don't</span>
<span class="output">Some people</span>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo know how #useful Bash can be</span>
<span class="output">know how</span>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo to use a computer #with a command line interface.</span>
<span class="output">to use a computer</span>
<span class="prompt">YourName@ComputerName:~$ </span>
</pre>
<hr />
<h2><a name="space_separated">Rule 2: Words are separated by spaces (usually).</a></h2>
<p>When you type <q><code>This is some text.</code></q> into Bash, what are the
list of tokens it sees? Well, first thing Bash will do is just tear the sentence
up along the spaces between the tokens:</p>
<div class="user-input"><code>This is some text.</code></div>
<ol>
<li><code class="token">This</code></li>
<li><code class="token">is</code></li>
<li><code class="token">some</code></li>
<li><code class="token">text.</code></li>
</ol>
<p>But do you see how <q><code class="token">text.</code></q> has a dot after it? This is
because there is no space between the token "text" and the dot. That means the
dot is part of the token. So what would happen if you wrote a space between the
token <q><code>text</code></q> and the dot:
<q><code>This is some text .</code></q>?
Well then, Bash would see this list of tokens:</p>
<div class="user-input"><code>This is some text .</code></div>
<ol>
<li><code class="token">This</code></li>
<li><code class="token">is</code></li>
<li><code class="token">some</code></li>
<li><code class="token">text</code></li>
<li><code class="token">.</code></li>
</ol>
<p>With a space between the token <q><code>text</code></q> and the dot, the dot
becomes it's own token.
<blockquote class="remember">
<b>Good to remember:</b> while Bash does not think the lone dot <code
class="single-char">.</code> is special, the dot <em>could</em> be special to
functions Bash is using, like the <code class="user-input">ls</code> function. For
some functions, the the token means <q><b>right here</b>,</q> as in, <q>save a
file <b>right here</b>.</q> Other times, dot just means a dot, like when it is
part of a file's name, e.g. <code class="output">photo.jpg</code>. But in the
Bash language, dots (and also commas) have no special grammatical meaning, they
are just part of ordinary tokens, and get mixed together with other tokens
according to the usual tokenizer rules.
</blockquote>
<p>Words are made of letters, numbers and the non-special punctuation marks discussed below.</p>
<div><code>This sentence has 7 tokens in it #and this is ignored.</code></div>
<ol>
<li><code class="token">This</code></li>
<li><code class="token">sentence</code></li>
<li><code class="token">has</code></li>
<li><code class="token">7</code></li>
<li><code class="token">tokens</code></li>
<li><code class="token">in</code></code></li>
<li><code class="token">it</code></li>
</ol>
<hr />
<h2><a name="quotes">Rule 3: Quoted tokens can have spaces in them</a></h2>
<p>It is often useful to tell bash to use a whole bunch of tokens as just one
token. This comes in handy when telling Bash to use a file, where the file name
has spaces in it. To do this, we use a single-quote character, also known as
the apostrophe, for example:
<div class="user-input"><code>His exact words were, 'Yes, I think so.'</code></p></div>
<ol>
<li><code class="token">His</code></li>
<li><code class="token">exact</code></li>
<li><code class="token">words</code></li>
<li><code class="token">were,</code></li>
<li><code class="token">Yes, I think so.</code></li>
</ol>
<p>The fifth token above is everything between the single-quotes. Notice that
the single-quotes do not exist in the token itself. Dots and commas have no
special grammatical meaning to Bash, but single-quotes do. A single-quote says
to Bash, <b>take all the letters you see until the next single-quote and treat
them as one big token,</b> and remove the single-quotes.</p>
<p>Which character is more powerful, the hash <code
class="single-char">#</code> or the single quote <code
class="single-char">'</code>? The answer is: <u>which ever one comes first</u>
is the one Bash uses. Remember what happened when we tried this command:
<code>echo ### Welcome! ###</code>? The first hash <code
class="single-char">#</code> character commented everything after it. Lets try
this again
with the single-quote character:</p>
<pre>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo '### Welcome! ###'</span>
<span class="output">### Welcome! ###</span>
<span class="prompt">YourName@ComputerName:~$ </span>
</pre>
<p>Putting the <code class="token">### Welcome! ###</code> inside of the
single-quotes made the hash characters into part of the token. But if we flipped
it around and wrote the hash character before the single-quote, the hash would
win:</p>
<pre>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo ### 'Welcome!' ###</span>
<span class="output"></span>
<span class="prompt">YourName@ComputerName:~$ </span>
</pre>
<p>It is also possible to use a double-quote <code class="single-char">&quot;</code>
character to construct tokens with spaces, <b>but be careful!</b> Double-quote
characters have an entirely different set of rules they follow when
constructing tokens. For simple tokens, they behave like single-quote <code
class="token">'</code> characters. Let's retry the <q>Yes, I think so.</q>
example above but with double-quotes <code class="token">&quot;</code> instead
of single quotes:
</p>
<div class="user-input"><code>His exact words were, "Yes, I think so."</code></p></div>
<ol>
<li><code class="token">His</code></li>
<li><code class="token">exact</code></li>
<li><code class="token">words</code></li>
<li><code class="token">were,</code></li>
<li><code class="token">Yes, I think so.</code></li>
</ol>
<b>However</b>, things start to go wrong if you aren't careful when you make
tokens using double-quotes <code class="single-char">&quot;</code>:
<pre>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo "It costs less than $5 in the USA."</span>
<span class="output">It costs less than in the USA.</span>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo "Hello, world!"</span>
<span class="output">Hello, world!</span>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo "Hello, world!!"</span>
<span class="output">echo "Hello, worldecho "Hello, world!""</span>
<span class="output">Hello, worldecho Hello, world!</span>
</pre>
<p>Double-quotes <code class="single-char">&quot;</code> tokens are used to expand
variables into character strings, a function known as
<q><a href="https://en.wikipedia.org/wiki/String_interpolation">String
Interpolation</a></q>. We will talk more about string interpolation in the
lesson about Bash variables, but here is a quick preview of what double-quote
tokens can do when used correctly:
</p>
<pre>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">ITEM='Apple Cinnamon Cappuccino'</span>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">COST=3.95</span>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo "You can buy a delicious ${ITEM} for only \$${COST}"\!</span>
<span class="output">You can buy a delicious Apple Cinnamon Cappuccino for only $3.95!</span>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input"># Lets try the exact same thing with single quotes...</span>
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo 'You can buy a delicious ${ITEM} for only \$${COST}'\!</span>
<span class="output">You can buy a delicious ${ITEM} for only \$${COST}!</span>
</pre>
<hr />
<h2><a name="join_adjacent">Rule 4: Tokens not separated by spaces are joined together into a single token.</a></h2>
<p>How is this different from <a class="section-link" href="#space_separated">rule #2</a>? If you have two tokens, like a number
and a word, right next to each other, for example <q><code>123hello</code></q>,
it is probably obvious to you that Bash will treat treat those as a single
token. But what about input like this:
<div><code>'Working hard?''Hardly working!'</code></div>
<p>Will this be two tokens or just one token? The answer is: <u>one token</u>,
because there is no space between the first and second single-quoted tokens.
The two individual tokens:</p>
<p>
<code class="token">Working hard?</code> <code
class="token">Hardly working!</code>
</p>
<p>are joined into a single token.</p>
<ol>
<li><code class="token">Working hard?Hardly working!</code>
</ol>
<p>Each token is a single-quoted token, but there is no space between the two
single-quoted tokens, so Bash joins these two tokens into one big token. This
is a very useful feature which will come up again in the lesson about
variables.
</p>
<p>But what happens if we type something like this:
<div><code>He won't do it because he doesn't even know how.</code></div>
<ol>
<li><code class="token">He</code>
<li><code class="token">wont do it because he doesnt</code>
<li><code class="token">even</code>
<li><code class="token">know</code>
<li><code class="token">how.</code>
</ol>
<p>Only 5 tokens. Why? Because Bash thinks the apostrophes in the tokens
<q><i>won't</i></q> and <q><i>doesn't</i></q> are actually single-quotes, and
all of the letters and spaces between those single-quotes, <q><code>'t do it
because he doesn'</code></q> becomes one long token <q><code class="token">t do
it because he doesn</code></q>. So the input <q><code>won't do it because he
doesn't</code></q> is actually a three-part token:</p> <div><code
class="token">won</code> <code class="token">t do it because he doesn</code>
<code class="token">t</code></div> <p>And since there is no space between these
tokens, they are all joined into one big token, as you can see above.</p>
<hr />
<h2><a name="special_punctuation">Rule 5: Most punctuation marks have special grammatical meaning</a></h2>
<p>We have seen how the hash <code class="single-char">#</code> and
single-quote <code class="single-char">'</code> characters have special
grammatical meaning to bash. It is important to note that most punctuation
marks have special grammatical meaning.</p>
<p><b>That means, never use the following characters without quoting them</b>
unless you know what special thing they do. (Listed for reference, don't worry
about what this means for now)</p>
<table>
<tbody>
<tr><td valign="top"><code class="token">#</code></td><td valign="top">Hash</td><td valign="top">&mdash; Comment</td></tr>
<tr><td valign="top"><code class="token">'</code></td><td valign="top">Single Quote</td><td valign="top">&mdash; String delimiter</td></tr>
<tr><td valign="top"><code class="token">"</code></td><td valign="top">Double Quote</td><td valign="top">&mdash; Interpolating string delimiter</td></tr>
<tr><td valign="top"><code class="token">`</code></td><td valign="top">Back Quote</td><td valign="top">&mdash; Sub-process expansion</td></tr>
<tr><td valign="top"><code class="token">\</code></td><td valign="top">Backslash</td><td valign="top">&mdash; Escape special character</td></tr>
<tr><td valign="top"><code class="token">$</code></td><td valign="top">Dollar Sign</td><td valign="top">&mdash; Variable dereferencing</td></tr>
<tr><td valign="top"><code class="token">*</code></td><td valign="top">Asterisk</td><td valign="top">&mdash; Glob (a.k.a. Wildcard) pattern</td></tr>
<tr><td valign="top"><code class="token">&amp;</code></td><td valign="top">Ampersand</td><td valign="top">&mdash; Launch background command</td></tr>
<tr><td valign="top"><code class="token">&semi;</code></td><td valign="top">Semicolon</td><td valign="top">&mdash; Command delimiter</td></tr>
<tr><td valign="top"><code class="token">&lt;</code></td><td valign="top">Less Than</td><td valign="top">&mdash; Pull stream input from file</td></tr>
<tr><td valign="top"><code class="token">&gt;</code></td><td valign="top">Greater Than</td><td valign="top">&mdash; Push stream output to file</td></tr>
<tr><td valign="top"><code class="token">=</code></td><td valign="top">Equal Sign</td><td valign="top">&mdash; Assign variable</td></tr>
<tr><td valign="top"><code class="token">|</code></td><td valign="top">Pipe</td><td valign="top">&mdash; Command pipeline constructor</td></tr>
<tr><td valign="top"><code class="token">(</code></td><td valign="top">Open Round Bracket</td><td valign="top">&mdash; Sub-process command delimiter</td></tr>
<tr><td valign="top"><code class="token">)</code></td><td valign="top">Close Round Bracket</td><td valign="top">&mdash; Sub-process command delimiter</td></tr>
<tr><td valign="top"><code class="token">{</code></td><td valign="top">Open Curly Brackets</td><td valign="top">&mdash; Token choice pattern, or subroutine delimiter</td></tr>
<tr><td valign="top"><code class="token">}</code></td><td valign="top">Close Curly Brackets</td><td valign="top">&mdash; Token choice pattern, or subroutine delimiter</td></tr>
</tbody>
</table>
<p>Some characters are <b>sometimes</b> special and sometimes not. Avoid using
the following characters (again, unless you know what special thing they
do):</p>
<table>
<tbody>
<tr><td valign="top"><code class="single-char">%</code></td><td valign="top">Percent</td><td valign="top">&mdash; Background process selector (only special when used alone or with a number)</td></tr>
<tr><td valign="top"><code class="single-char">~</code></td><td valign="top">Tilde</td><td valign="top">&mdash; Abbreviation for home directory (only special at the start of a non-quoted token)</td></tr>
<tr><td valign="top"><code class="single-char">!</code></td><td valign="top">Exclamation Point</td><td valign="top">&mdash; Command history selection (only special when used alone, or with a number)</td></tr>
<tr><td valign="top"><code class="single-char">[</code></td><td valign="top">Open Square Bracket</td><td valign="top">&mdash; Character set pattern (only special when files matching the pattern exist)</td></tr>
<tr><td valign="top"><code class="single-char">]</code></td><td valign="top">Close Square Bracket</td><td valign="top">&mdash; Character set pattern (only special when files matching the pattern exist)</td></tr>
</tbody>
</table>
<p>All other punctuation marks are used as parts of tokens, or become their own
token if they are separated by spaces. <b>So it is OK to use the following characters:</b></p>
<table>
<tbody>
<tr><td valign="top"><code class="single-char">_</code></td><td>Underscore</td></tr>
<tr><td valign="top"><code class="single-char">+</code></td><td>Plus Sign</td></tr>
<tr><td valign="top"><code class="single-char">-</code></td><td>Minus Sign</td></tr>
<tr><td valign="top"><code class="single-char">@</code></td><td>At Sign</td></tr>
<tr><td valign="top"><code class="single-char">/</code></td><td>Slash</td></tr>
<tr><td valign="top"><code class="single-char">:</code></td><td>Colon</td></tr>
<tr><td valign="top"><code class="single-char">.</code></td><td>Dot</td></tr>
<tr><td valign="top"><code class="single-char">,</code></td><td>Comma</td></tr>
<tr><td valign="top"><code class="single-char">^</code></td><td>Carrot</td></tr>
</tbody>
</table>
<p>So lets see the kind of tokens we can make with the non-special characters:</p>
<div><code class="user-input">one/two/three four-five-six seven.eight.nine@ 10:11 p.m. me+my_date</code></div>
<ol>
<li><code class="token">one/two/three</code></li>
<li><code class="token">four-five-six</code></li>
<li><code class="token">seven.eight.nine@</code></li>
<li><code class="token">10:11</code></li>
<li><code class="token">p.m.</code></li>
<li><code class="token">me+my_date</code></li>
</ol>
<p>The non-special characters are just a part of the token in which they
appear, as if they were no different from a letter or number. But the spaces
between the words still separate tokens according to <a class="section-link" href="#space_separated">rule #2</a>.</p>
<h3>Rule #5.1: special tokens do not join with other tokens</h3>
<p>So <a class="section-link" href="#join_adjacent">rule #4</a> does not apply to special tokens.
Lets take a quick look at how special tokens are tokenized. </p>
<div><code>if(true);then{echo yes;}fi|cat -n;</code></div>
<p>In this example, there are several special characters used:
<code class="single-char">&semi;</code>,
<code class="single-char">|</code>,
<code class="single-char">(</code>,
<code class="single-char">)</code>,
<code class="single-char">{</code>, and
<code class="single-char">}</code> (the hyphen
<code class="single-char">-</code> is not special). So how do you think this will tokenize?
</p>
<blockquote class="notice">
<b>BE AWARE</b> that this example will <b>NOT</b> work with the <q><a href="#tokenizer.sh" class="file-name">tokenizer.sh</a></q> program. If
you do try it, it will report an error:
<pre>
<span class="output">bash: syntax error near unexpected token `('</span>
</pre>
</blockquote>
<p>The answer is that it will tokenize like this (but again don't try this with
<q><a href="#tokenizer.sh" class="file-name">tokenizer.sh</a></q>):</p>
<ol>
<li><code class="token">if</code></li>
<li><code class="token">(</code></li>
<li><code class="token">true</code></li>
<li><code class="token">)</code></li>
<li><code class="token">then</code></li>
<li><code class="token">{</code></li>
<li><code class="token">echo</code></li>
<li><code class="token">yes</code></li>
<li><code class="token">&semi;</code></li>
<li><code class="token">}</code></li>
<li><code class="token">fi</code></li>
<li><code class="token">|</code></li>
<li><code class="token">cat</code></li>
<li><code class="token">-n</code></li>
<li><code class="token">&semi;</code></li>
</ol>
<p>However these tokens are swept up by Bash and <b>immediately</b> crunched
into something else (in this case, a "conditional statement"), and this happens
even before the tokens are handed off to other programs like our
</q><a href="#tokenizer.sh" class="file-name">tokenizer.sh</a></q> program. There is a more advanced Bash grammar that
occurs after the tokenization step which allows you to control if and when
certain commands are run, which we will learn more about in another lesson.</p>
<hr />
<h2><a name="backslash">Rule 6: Backslash makes a special punctuation mark ordinary</a></h2>
<p>The last rule to remember is that all of the above mentioned special
characters become ordinary tokens, or parts of tokens, if they follow a
backslash <code class="single-char">\</code>. For example, if you want a token
to contain an apostrophe without bash thinking it is a single-quote, you could
write this:
<div><code>I won\'t make that mistake again.</code></div>
<ol>
<li><code class="token">I</code></li>
<li><code class="token">won't</code></li>
<li><code class="token">make</code></li>
<li><code class="token">that</code></li>
<li><code class="token">mistake</code></li>
<li><code class="token">again.</code></li>
</ol>
<div><code>He won\'t do it because he doesn\'t even know how.</code></div>
<ol>
<li><code class="token">He</code></li>
<li><code class="token">won't</code></li>
<li><code class="token">do</code></li>
<li><code class="token">it</code></li>
<li><code class="token">because</code></li>
<li><code class="token">he</code></li>
<li><code class="token">doesn't</code></li>
<li><code class="token">even</code></li>
<li><code class="token">know</code></li>
<li><code class="token">how.</code></li>
</ol>
<p>Any character at all, even spaces and hash tags, can be made to be part of a
token with the backslash:</p>
<div><code>This\ is\ one\ long\ token. \# These are separate tokens.</code></div>
<ol>
<li><code class="token">This is one long token.</code></li>
<li><code class="token">#</code></li>
<li><code class="token">These</code></li>
<li><code class="token">are</code></li>
<li><code class="token">separate</code></li>
<li><code class="token">tokens.</code></li>
</ol>
<p>Notice above that there is no backslash right after the <code
class="token">token.</code> token, so the space is not "escaped" by the
backslash, and the token breaks there. All preceding tokens are joined
together into one large token.
</p>
<p>But a backslash is only good for one character:</p>
<div><code>With single-quotes '###' but with a backslash \### all the rest is ignored.</code></div>
<ol>
<li><code class="token">With</code></li>
<li><code class="token">single-quotes</code></li>
<li><code class="token">###</code></li>
<li><code class="token">but</code></li>
<li><code class="token">with</code></li>
<li><code class="token">a</code></li>
<li><code class="token">backslash</code></li>
<li><code class="token">#</code></li>
</ol>
<p>The backslash only worked it's magic on the first hash <code
class="single-char">#</code> character, the one after it was ignored.</p>
</p>How would we write an apostrophe in the middle of a single-quoted token? Like this:
<div><code>She said, 'Well isn'\''t that something!'</code></div>
<ol>
<li><code class="token">She</code>
<li><code class="token">said,</code>
<li><code class="token">Well isn't that something!</code>
</ol>
</p>
<p>Why? Because the string
<code class="user-input">'Well isn'\''t that something!'</code>
contains three tokens
<div>
<code class="token">Well isn</code>&nbsp;<code class="token">'</code>&nbsp;<code class="token">t that something!</code>
</div>
which are not separated by white spaces, so the three tokens are joined into
one according to <a class="section-link" href="#join_adjacent">rule #4</a>.
</p>
</p>And backslashes treat <em>themselves</em> as ordinary tokens as well. That is to
say, if a one backslash is followed by a second backslash, the second backslash
is treated as an ordinary token. For sequences of backslashes, every two backslash
<code class="single-char">\</code><code class="single-char">\</code>
characters become a single backslash character.
<code class="single-char">\</code>
<div><code>1 \\ 2 \\\\ 3 \\\\\\ 4 \\\\\\\\ 5 \\\\\\\\\\</code></div>
<ol>
<li><code class="token">1</code>
<li><code class="token">\</code>
<li><code class="token">2</code>
<li><code class="token">\\</code>
<li><code class="token">3</code>
<li><code class="token">\\\</code>
<li><code class="token">4</code>
<li><code class="token">\\\\</code>
<li><code class="token">5</code>
<li><code class="token">\\\\\</code>
</ol>
</p>
<hr />
<h2>Conclusion</h2>
<p>So those are the most fundamental tokenizer rules for bash. There will be more
rules, but these are the most important to remember. Usually, if we just never
use file names with spaces or punctuation marks in them, we never have to worry
about single-quoting or backslashes, and our life becomes easier. We can just
write tokens as they are and Bash will work as we expect it to.</p>
<p>This is why Linux and UNIX programmers like to name files like this:
<q><code>a-file-name-should-never-have-spaces.txt</code></q>. Because they use
Bash, and working with file names in Bash can get a bit tedious if they have
spaces or special characters in their name.</p>
<p>So here are all the basic tokenizer rules in Bash in a handy table which you
may want to keep in your notebook.</p>
<table>
<tr><td>1. Ignore hash tags</td><td><code>token token token # ignored ignored ignored</code></td></tr>
<tr><td>2. tokens are separated by spaces</td><td><code>this sentence has 7 tokens in it</code></td></tr>
<tr><td>3. tokens can have spaces.</td><td><code>'this is just one token'</code> <code>this\ is\ also\ just\ one\ token</code></td></tr>
<tr><td>4. tokens not separated by spaces are joined together into a single token.</td><td><code>firsttoken</code> <code>second' token'</code> <code>'third''token'</code> <code>fourth\ 'token'</code></td></tr>
<tr><td>5. Most punctuation marks have special meaning.</td><td>Special characters are: <code class="token">#</code> <code class="token">'</code> <code class="token">"</code> <code class="token">`</code> <code class="token">\</code> <code class="token">%</code> <code class="token">$</code> <code class="token">*</code> <code class="token">&amp;</code> <code class="token">&semi;</code> <code class="token">!</code> <code class="token">~</code> <code class="token">&lt;</code> <code class="token">&gt;</code> <code class="token">=</code> <code class="token">|</code> <code class="token">(</code> <code class="token">)</code> <code class="token">{</code> <code class="token">}</code></td></tr>
<tr><td>6. The backslash makes special characters ordinary.</td><td>You can enter <q><code>\#</code></q> or <q><code>\'</code></q> to use those characters alone.</td></tr>
</table>
<hr />
<H5>Copyright &copy; Ramin Honary 2017. This document is published under the <a
href="https://creativecommons.org/license/by-nc/3.0/legalcode">Creative Commons
Attribution-NonCommercial 3.0 Unported</a> license. The markup source code is
available on GitHub at <a
href="https://gist.github.com/RaminHAL9001/22bcb9c32786f089fb973444adfa0619">https://gist.github.com/RaminHAL9001/22bcb9c32786f089fb973444adfa0619</a>.</H4>
</div>
</body></html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment