Last active
November 18, 2017 15:54
-
-
Save RaminHAL9001/22bcb9c32786f089fb973444adfa0619 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<html><head> | |
<style type="text/css"> | |
div.top-level { | |
width: 20cm; | |
left-margin: 0.5cm; | |
right-margin: 0.5cm; | |
} | |
p, ol, td, blockquote { | |
font-family: serif; | |
color: #000000; | |
line-height: 1.5em; | |
} | |
h1 { | |
font-family: serif; | |
color: #000000; | |
} | |
h2 { | |
font-family: serif; | |
color: #000000; | |
} | |
h3 { | |
font-family: serif; | |
color: #000000; | |
} | |
code { | |
background-color: #F0F0F0; | |
color: #400000; | |
} | |
pre { | |
background-color: #F0F0F0; | |
color: #000000; | |
line-height: 1.4em; | |
} | |
.vocabulary-word { | |
font-weight: bold; | |
font-style: oblique; | |
color: #008000; | |
vertical-align: top; | |
} | |
span.prompt { | |
color: #808080; | |
} | |
span.file-name, a.file-name { | |
font-family: monospace; | |
text-decoration: underline; | |
color: #204020 | |
} | |
.user-input { | |
color: #400000; | |
} | |
a.section-link { | |
color: #000080; | |
text-decoration: none; | |
} | |
a:hover.section-link { | |
color: #0000FF; | |
text-decoration: underline; | |
} | |
code.single-char { | |
color: #400000; | |
border: 0.0625em solid; | |
border-radius: 0.25em; | |
background-color: #F0F0FF; | |
font-size: 1.4em; | |
} | |
.output { | |
color: #000040; | |
} | |
span.keystroke { | |
color: #000000; | |
border: 0.0625em solid; | |
border-radius: 0.25em; | |
font-family: sans-serif; | |
font-style: oblique; | |
} | |
code.token { | |
color: #0000F8; | |
border: 0.0625em solid; | |
border-radius: 0.25em; | |
background-color: #F0F0FF; | |
} | |
td.source-code { | |
background-color: #FFFFFF; | |
border: 1px solid black; | |
padding: 2px; | |
} | |
pre.source-code { | |
background-color: #FFFFFF; | |
} | |
table.source-code { | |
background-color: #F0F0F0; | |
border: 1px solid black; | |
margin: 20px; | |
} | |
</style> | |
<title>Bash Basics: How The Command Shell "Sees" the Words it Reads</title> | |
</head><body> | |
<div class="top-level"> | |
<h1>Bash Basics: How The Command Shell "Sees" the Words it Reads</h1> | |
<p>As someone who can read, you may take for granted that every word is | |
separated by a space. But to an unintelligent computer program like Bash, the | |
process of breaking input into individual words must be defined in computer | |
code as a grammatical algorithm. | |
<p>Often times it can be very helpful if you understand these rules. When you | |
enter a command and Bash does not do what you expected, could it be because it | |
is simply reading or <q>understanding</q> your command in a way that you don't | |
expect it to? Often this is the problem, and a thorough understanding of how | |
Bash actually reads and understands commands can make your life much easier as | |
you become more skilled in using Ubuntu, Linux, or MacOS.</p> | |
<p>Fortunately, you don't need to be an expert to understand the grammatical | |
algorithm of Bash's tokenizing grammar, in fact most of the ordinary Bash token | |
grammar is quite simple for anyone to understand, although there are a few | |
complicated rules that experts need to worry about, but we will will worry | |
about that another day. Lets keep things simple for now:</p> | |
<em><b>When you type anything into Bash, the first thing it does is break down what | |
you typed into a list of words called <q>tokens</q>. | |
</b></em> | |
<p>Once Bash has it's list of words (tokens), it then <q>thinks</q> about each | |
word one by one. In this lesson, we learn the six most basic rules Bash uses to | |
read the command you typed, and how it breaks your command down into tokens | |
that it can understand and think about individually. There are actually a few | |
more than six rules, but this chapter goes over the most basic rules.</p> | |
<ol> | |
<li> <a class="section-link" href="#hash_tags">Ignore hash tags.</a> | |
<li> <a class="section-link" href="#space_separated">Tokens are separated by spaces (usually).</a> | |
<li> <a class="section-link" href="#quotes">Tokens can have spaces in them if they are quoted.</a> | |
<li> <a class="section-link" href="#join_adjacent">Tokens that are not separated by spaces are joined together.</a> | |
<li> <a class="section-link" href="#special_punctuation">Some punctuation marks are special and are not joined together.</a> | |
<li> <a class="section-link" href="#backslash">The backslash turns a special punctuation mark into ordinary token.</a> | |
</ol> | |
<table class="glossary"><thead class="glossary" colspan="2"><h3>Terminology</h3></thead> | |
<tr> | |
<td class="vocabulary-word"><span class="vocabulary-word">Token</span>:</td><td>A single, atomic unit of computer code that is constructed | |
from a sequence of <q>characters</q> in accordance with the grammatical rules | |
of the computer language.</td></tr> | |
<td class="vocabulary-word"><span class="vocabulary-word">Character</span>:</td><td>A single letter, number, punctuation mark, or whitespace | |
value which contains the smallest amount of human-readable information.</td></tr> | |
<td class="vocabulary-word"><span class="vocabulary-word">String</span>:</td><td>A unit of data containing a sequence of characters. A string | |
is different from a token in that tokens are elements taken out of a computer | |
program according to token grammar rules, whereas a string can contain any data | |
without regard for token grammar. In the Bash programming language there is | |
almost no practical difference between tokens and strings, but many other | |
programming languages do not allow you to treat tokens as strings.</td> | |
</tr> | |
</table> | |
<h2>Before we begin...</h2> | |
<p>Bash is everywhere, so it is incredibly easy to open a Terminal window and | |
just start experimenting. It takes no effort, and you can do it any time you | |
want. So as you read, there is no need take this instruction as mere computer | |
science theory, you can actually put theory into practice!</p> | |
<p>So before we begin learning the rules, here is a two-line Bash program you | |
can try right now to experiment with the various examples below. Enter this | |
code into the TextEdit program and save the file as <q><a name="#tokenizer.sh" class="file-name">tokenizer.sh</a></q> in your | |
Home folder: </p> | |
<table class="source-code"> | |
<thead> | |
<tr><td> | |
🗎 <a name="tokenizer.sh"><span class="file-name">~/tokenizer.sh</code></a></span> | |
</td></tr></thead> | |
<tr><td class="source-code"> | |
<pre class="source-code"> | |
#!/bin/bash | |
( for x in "${@}" ; do echo "$x"; done; ) | cat -n | |
</pre> | |
</td></tr> | |
</table> | |
<p>Now open Terminal and check if you did it right:</p> | |
<pre> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">cat tokenizer.sh</span> | |
<span class="output">#!/bin/bash</span> | |
<span class="output">( for x in "${@}" ; do echo "$x"; done; ) | cat -n</span> | |
</pre> | |
<p>If instead you get an error:</p> | |
<pre> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">cat tokenizer.sh</span> | |
<span class="output">cat: tokenizer.sh: No such file or directory</span> | |
</pre> | |
<p>then please check to make sure you saved the <q><a href="#tokenizer.sh" class="file-name">tokenizer.sh</a></q> in the Home | |
folder, or else use the <code>cd</code> command to change to the directory in | |
which you did save the <q><a href="#tokenizer.sh" class="file-name">tokenizer.sh</a></q> file. | |
</p> | |
<p>Lets try running the <q><a href="#tokenizer.sh" class="file-name">tokenizer.sh</a></q> in the Terminal. Enter the text | |
<code class="user-input">bash tokenizer.sh This is an example.</code> as the | |
command text, the rest will be generated by Bash as soon as you press the enter | |
key. The whole interaction will look like this in your Terminal window:</p> | |
<pre> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">bash tokenizer.sh This is an example.</span> | |
<span class="output"> 1. This</span> | |
<span class="output"> 2. is</span> | |
<span class="output"> 3. an</span> | |
<span class="output"> 4. example.</span> | |
<span class="prompt">YourName@ComputerName:~$ </span> | |
</pre> | |
<p>Did it work? Great! So from now on, if you see an example in the text below, | |
which looks like this:</p> | |
<div><code>This is an example.</code></div> | |
<ol> | |
<li><code class="token">This</code></li> | |
<li><code class="token">is</code></li> | |
<li><code class="token">an</code></li> | |
<li><code class="token">example.</code></li> | |
</ol> | |
<p>don't be afraid to try the example out using <q><span class="file-name">tokenizer.sh</span></q>.</p> | |
<h3>If you get stuck...</h3> | |
<p>Never forget that <span class="keystroke">Ctrl C</span> will <B>C</b>ancel | |
the command you were typing and let you start again from nothing.</p> | |
<p>Occasionally you may mis-type something and Bash will freeze. For example, | |
if you type an apostrophe <code class="single-char">'</code> all alone: </p> | |
<pre> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">bash tokenizer.sh Why won't this work?</span> | |
<span class="prompt">> </span> | |
<span class="prompt">> </span> | |
<span class="prompt">> </span><span class="user-input"> echo try again</span> | |
<span class="prompt">> </span> | |
<span class="prompt">> </span> | |
<span class="prompt">> </span><span class="user-input"> askdjaczxca</span> | |
<span class="prompt">> </span><span class="user-input"> jasiuhq fidjfn ZXzxucaccas asdasdf </span> | |
<span class="prompt">> </span> | |
<span class="prompt">> </span> | |
<span class="prompt">> </span><span class="user-input"> aaaaaaaaaaaaaaaaaaaaaaaaaaaa </span> | |
<span class="prompt">> </span> | |
</pre> | |
<p>I kept pressing <span class="keystroke">Enter</span> but the command prompt | |
never came back, all I got was the <code class="single-char">></code> | |
symbols, and it wouldn't do anything! What is happening here is that the | |
apostrophe is actually a opening single-quote character (discussed in <a | |
href="#quotes" class="section-link">rule #3</a>) and Bash waits for you to write the closing | |
single-quote character; it waits even after you press the Enter key. | |
Double-quotes <code class="single-char">"</code> will cause the same | |
problem, as will parentheses or <code class="single-char">(</code> brackets | |
<code class="single-char">{</code>, as we will see with <a class="section-link" | |
href="#special_punctuation">rule #5</a>.</p> | |
<p>If you ever make this mistake, <span | |
class="keystroke">Ctrl C</span> is your friend. | |
</p> | |
<h3>So lets get started learning about the Bash tokenizer rules!</h3> | |
<hr /> | |
<h2><a name="hash_tags">Rule 1: Ignore hash tags</a></h2> | |
<p>This is the simplest rule: the <code class="single-char">#</code> character | |
is ignored. You can use this to write comments to yourself:</p> | |
<pre> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">#This line starts with a hash tag. It does absolutely nothing.</span> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">#But you can use a hash tag in the middle of a command as well:</span> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo Some people #just don't</span> | |
<span class="output">Some people</span> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo know how #useful Bash can be</span> | |
<span class="output">know how</span> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo to use a computer #with a command line interface.</span> | |
<span class="output">to use a computer</span> | |
<span class="prompt">YourName@ComputerName:~$ </span> | |
</pre> | |
<hr /> | |
<h2><a name="space_separated">Rule 2: Words are separated by spaces (usually).</a></h2> | |
<p>When you type <q><code>This is some text.</code></q> into Bash, what are the | |
list of tokens it sees? Well, first thing Bash will do is just tear the sentence | |
up along the spaces between the tokens:</p> | |
<div class="user-input"><code>This is some text.</code></div> | |
<ol> | |
<li><code class="token">This</code></li> | |
<li><code class="token">is</code></li> | |
<li><code class="token">some</code></li> | |
<li><code class="token">text.</code></li> | |
</ol> | |
<p>But do you see how <q><code class="token">text.</code></q> has a dot after it? This is | |
because there is no space between the token "text" and the dot. That means the | |
dot is part of the token. So what would happen if you wrote a space between the | |
token <q><code>text</code></q> and the dot: | |
<q><code>This is some text .</code></q>? | |
Well then, Bash would see this list of tokens:</p> | |
<div class="user-input"><code>This is some text .</code></div> | |
<ol> | |
<li><code class="token">This</code></li> | |
<li><code class="token">is</code></li> | |
<li><code class="token">some</code></li> | |
<li><code class="token">text</code></li> | |
<li><code class="token">.</code></li> | |
</ol> | |
<p>With a space between the token <q><code>text</code></q> and the dot, the dot | |
becomes it's own token. | |
<blockquote class="remember"> | |
<b>Good to remember:</b> while Bash does not think the lone dot <code | |
class="single-char">.</code> is special, the dot <em>could</em> be special to | |
functions Bash is using, like the <code class="user-input">ls</code> function. For | |
some functions, the the token means <q><b>right here</b>,</q> as in, <q>save a | |
file <b>right here</b>.</q> Other times, dot just means a dot, like when it is | |
part of a file's name, e.g. <code class="output">photo.jpg</code>. But in the | |
Bash language, dots (and also commas) have no special grammatical meaning, they | |
are just part of ordinary tokens, and get mixed together with other tokens | |
according to the usual tokenizer rules. | |
</blockquote> | |
<p>Words are made of letters, numbers and the non-special punctuation marks discussed below.</p> | |
<div><code>This sentence has 7 tokens in it #and this is ignored.</code></div> | |
<ol> | |
<li><code class="token">This</code></li> | |
<li><code class="token">sentence</code></li> | |
<li><code class="token">has</code></li> | |
<li><code class="token">7</code></li> | |
<li><code class="token">tokens</code></li> | |
<li><code class="token">in</code></code></li> | |
<li><code class="token">it</code></li> | |
</ol> | |
<hr /> | |
<h2><a name="quotes">Rule 3: Quoted tokens can have spaces in them</a></h2> | |
<p>It is often useful to tell bash to use a whole bunch of tokens as just one | |
token. This comes in handy when telling Bash to use a file, where the file name | |
has spaces in it. To do this, we use a single-quote character, also known as | |
the apostrophe, for example: | |
<div class="user-input"><code>His exact words were, 'Yes, I think so.'</code></p></div> | |
<ol> | |
<li><code class="token">His</code></li> | |
<li><code class="token">exact</code></li> | |
<li><code class="token">words</code></li> | |
<li><code class="token">were,</code></li> | |
<li><code class="token">Yes, I think so.</code></li> | |
</ol> | |
<p>The fifth token above is everything between the single-quotes. Notice that | |
the single-quotes do not exist in the token itself. Dots and commas have no | |
special grammatical meaning to Bash, but single-quotes do. A single-quote says | |
to Bash, <b>take all the letters you see until the next single-quote and treat | |
them as one big token,</b> and remove the single-quotes.</p> | |
<p>Which character is more powerful, the hash <code | |
class="single-char">#</code> or the single quote <code | |
class="single-char">'</code>? The answer is: <u>which ever one comes first</u> | |
is the one Bash uses. Remember what happened when we tried this command: | |
<code>echo ### Welcome! ###</code>? The first hash <code | |
class="single-char">#</code> character commented everything after it. Lets try | |
this again | |
with the single-quote character:</p> | |
<pre> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo '### Welcome! ###'</span> | |
<span class="output">### Welcome! ###</span> | |
<span class="prompt">YourName@ComputerName:~$ </span> | |
</pre> | |
<p>Putting the <code class="token">### Welcome! ###</code> inside of the | |
single-quotes made the hash characters into part of the token. But if we flipped | |
it around and wrote the hash character before the single-quote, the hash would | |
win:</p> | |
<pre> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo ### 'Welcome!' ###</span> | |
<span class="output"></span> | |
<span class="prompt">YourName@ComputerName:~$ </span> | |
</pre> | |
<p>It is also possible to use a double-quote <code class="single-char">"</code> | |
character to construct tokens with spaces, <b>but be careful!</b> Double-quote | |
characters have an entirely different set of rules they follow when | |
constructing tokens. For simple tokens, they behave like single-quote <code | |
class="token">'</code> characters. Let's retry the <q>Yes, I think so.</q> | |
example above but with double-quotes <code class="token">"</code> instead | |
of single quotes: | |
</p> | |
<div class="user-input"><code>His exact words were, "Yes, I think so."</code></p></div> | |
<ol> | |
<li><code class="token">His</code></li> | |
<li><code class="token">exact</code></li> | |
<li><code class="token">words</code></li> | |
<li><code class="token">were,</code></li> | |
<li><code class="token">Yes, I think so.</code></li> | |
</ol> | |
<b>However</b>, things start to go wrong if you aren't careful when you make | |
tokens using double-quotes <code class="single-char">"</code>: | |
<pre> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo "It costs less than $5 in the USA."</span> | |
<span class="output">It costs less than in the USA.</span> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo "Hello, world!"</span> | |
<span class="output">Hello, world!</span> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo "Hello, world!!"</span> | |
<span class="output">echo "Hello, worldecho "Hello, world!""</span> | |
<span class="output">Hello, worldecho Hello, world!</span> | |
</pre> | |
<p>Double-quotes <code class="single-char">"</code> tokens are used to expand | |
variables into character strings, a function known as | |
<q><a href="https://en.wikipedia.org/wiki/String_interpolation">String | |
Interpolation</a></q>. We will talk more about string interpolation in the | |
lesson about Bash variables, but here is a quick preview of what double-quote | |
tokens can do when used correctly: | |
</p> | |
<pre> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">ITEM='Apple Cinnamon Cappuccino'</span> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">COST=3.95</span> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo "You can buy a delicious ${ITEM} for only \$${COST}"\!</span> | |
<span class="output">You can buy a delicious Apple Cinnamon Cappuccino for only $3.95!</span> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input"># Lets try the exact same thing with single quotes...</span> | |
<span class="prompt">YourName@ComputerName:~$ </span><span class="user-input">echo 'You can buy a delicious ${ITEM} for only \$${COST}'\!</span> | |
<span class="output">You can buy a delicious ${ITEM} for only \$${COST}!</span> | |
</pre> | |
<hr /> | |
<h2><a name="join_adjacent">Rule 4: Tokens not separated by spaces are joined together into a single token.</a></h2> | |
<p>How is this different from <a class="section-link" href="#space_separated">rule #2</a>? If you have two tokens, like a number | |
and a word, right next to each other, for example <q><code>123hello</code></q>, | |
it is probably obvious to you that Bash will treat treat those as a single | |
token. But what about input like this: | |
<div><code>'Working hard?''Hardly working!'</code></div> | |
<p>Will this be two tokens or just one token? The answer is: <u>one token</u>, | |
because there is no space between the first and second single-quoted tokens. | |
The two individual tokens:</p> | |
<p> | |
<code class="token">Working hard?</code> <code | |
class="token">Hardly working!</code> | |
</p> | |
<p>are joined into a single token.</p> | |
<ol> | |
<li><code class="token">Working hard?Hardly working!</code> | |
</ol> | |
<p>Each token is a single-quoted token, but there is no space between the two | |
single-quoted tokens, so Bash joins these two tokens into one big token. This | |
is a very useful feature which will come up again in the lesson about | |
variables. | |
</p> | |
<p>But what happens if we type something like this: | |
<div><code>He won't do it because he doesn't even know how.</code></div> | |
<ol> | |
<li><code class="token">He</code> | |
<li><code class="token">wont do it because he doesnt</code> | |
<li><code class="token">even</code> | |
<li><code class="token">know</code> | |
<li><code class="token">how.</code> | |
</ol> | |
<p>Only 5 tokens. Why? Because Bash thinks the apostrophes in the tokens | |
<q><i>won't</i></q> and <q><i>doesn't</i></q> are actually single-quotes, and | |
all of the letters and spaces between those single-quotes, <q><code>'t do it | |
because he doesn'</code></q> becomes one long token <q><code class="token">t do | |
it because he doesn</code></q>. So the input <q><code>won't do it because he | |
doesn't</code></q> is actually a three-part token:</p> <div><code | |
class="token">won</code> <code class="token">t do it because he doesn</code> | |
<code class="token">t</code></div> <p>And since there is no space between these | |
tokens, they are all joined into one big token, as you can see above.</p> | |
<hr /> | |
<h2><a name="special_punctuation">Rule 5: Most punctuation marks have special grammatical meaning</a></h2> | |
<p>We have seen how the hash <code class="single-char">#</code> and | |
single-quote <code class="single-char">'</code> characters have special | |
grammatical meaning to bash. It is important to note that most punctuation | |
marks have special grammatical meaning.</p> | |
<p><b>That means, never use the following characters without quoting them</b> | |
unless you know what special thing they do. (Listed for reference, don't worry | |
about what this means for now)</p> | |
<table> | |
<tbody> | |
<tr><td valign="top"><code class="token">#</code></td><td valign="top">Hash</td><td valign="top">— Comment</td></tr> | |
<tr><td valign="top"><code class="token">'</code></td><td valign="top">Single Quote</td><td valign="top">— String delimiter</td></tr> | |
<tr><td valign="top"><code class="token">"</code></td><td valign="top">Double Quote</td><td valign="top">— Interpolating string delimiter</td></tr> | |
<tr><td valign="top"><code class="token">`</code></td><td valign="top">Back Quote</td><td valign="top">— Sub-process expansion</td></tr> | |
<tr><td valign="top"><code class="token">\</code></td><td valign="top">Backslash</td><td valign="top">— Escape special character</td></tr> | |
<tr><td valign="top"><code class="token">$</code></td><td valign="top">Dollar Sign</td><td valign="top">— Variable dereferencing</td></tr> | |
<tr><td valign="top"><code class="token">*</code></td><td valign="top">Asterisk</td><td valign="top">— Glob (a.k.a. Wildcard) pattern</td></tr> | |
<tr><td valign="top"><code class="token">&</code></td><td valign="top">Ampersand</td><td valign="top">— Launch background command</td></tr> | |
<tr><td valign="top"><code class="token">;</code></td><td valign="top">Semicolon</td><td valign="top">— Command delimiter</td></tr> | |
<tr><td valign="top"><code class="token"><</code></td><td valign="top">Less Than</td><td valign="top">— Pull stream input from file</td></tr> | |
<tr><td valign="top"><code class="token">></code></td><td valign="top">Greater Than</td><td valign="top">— Push stream output to file</td></tr> | |
<tr><td valign="top"><code class="token">=</code></td><td valign="top">Equal Sign</td><td valign="top">— Assign variable</td></tr> | |
<tr><td valign="top"><code class="token">|</code></td><td valign="top">Pipe</td><td valign="top">— Command pipeline constructor</td></tr> | |
<tr><td valign="top"><code class="token">(</code></td><td valign="top">Open Round Bracket</td><td valign="top">— Sub-process command delimiter</td></tr> | |
<tr><td valign="top"><code class="token">)</code></td><td valign="top">Close Round Bracket</td><td valign="top">— Sub-process command delimiter</td></tr> | |
<tr><td valign="top"><code class="token">{</code></td><td valign="top">Open Curly Brackets</td><td valign="top">— Token choice pattern, or subroutine delimiter</td></tr> | |
<tr><td valign="top"><code class="token">}</code></td><td valign="top">Close Curly Brackets</td><td valign="top">— Token choice pattern, or subroutine delimiter</td></tr> | |
</tbody> | |
</table> | |
<p>Some characters are <b>sometimes</b> special and sometimes not. Avoid using | |
the following characters (again, unless you know what special thing they | |
do):</p> | |
<table> | |
<tbody> | |
<tr><td valign="top"><code class="single-char">%</code></td><td valign="top">Percent</td><td valign="top">— Background process selector (only special when used alone or with a number)</td></tr> | |
<tr><td valign="top"><code class="single-char">~</code></td><td valign="top">Tilde</td><td valign="top">— Abbreviation for home directory (only special at the start of a non-quoted token)</td></tr> | |
<tr><td valign="top"><code class="single-char">!</code></td><td valign="top">Exclamation Point</td><td valign="top">— Command history selection (only special when used alone, or with a number)</td></tr> | |
<tr><td valign="top"><code class="single-char">[</code></td><td valign="top">Open Square Bracket</td><td valign="top">— Character set pattern (only special when files matching the pattern exist)</td></tr> | |
<tr><td valign="top"><code class="single-char">]</code></td><td valign="top">Close Square Bracket</td><td valign="top">— Character set pattern (only special when files matching the pattern exist)</td></tr> | |
</tbody> | |
</table> | |
<p>All other punctuation marks are used as parts of tokens, or become their own | |
token if they are separated by spaces. <b>So it is OK to use the following characters:</b></p> | |
<table> | |
<tbody> | |
<tr><td valign="top"><code class="single-char">_</code></td><td>Underscore</td></tr> | |
<tr><td valign="top"><code class="single-char">+</code></td><td>Plus Sign</td></tr> | |
<tr><td valign="top"><code class="single-char">-</code></td><td>Minus Sign</td></tr> | |
<tr><td valign="top"><code class="single-char">@</code></td><td>At Sign</td></tr> | |
<tr><td valign="top"><code class="single-char">/</code></td><td>Slash</td></tr> | |
<tr><td valign="top"><code class="single-char">:</code></td><td>Colon</td></tr> | |
<tr><td valign="top"><code class="single-char">.</code></td><td>Dot</td></tr> | |
<tr><td valign="top"><code class="single-char">,</code></td><td>Comma</td></tr> | |
<tr><td valign="top"><code class="single-char">^</code></td><td>Carrot</td></tr> | |
</tbody> | |
</table> | |
<p>So lets see the kind of tokens we can make with the non-special characters:</p> | |
<div><code class="user-input">one/two/three four-five-six seven.eight.nine@ 10:11 p.m. me+my_date</code></div> | |
<ol> | |
<li><code class="token">one/two/three</code></li> | |
<li><code class="token">four-five-six</code></li> | |
<li><code class="token">seven.eight.nine@</code></li> | |
<li><code class="token">10:11</code></li> | |
<li><code class="token">p.m.</code></li> | |
<li><code class="token">me+my_date</code></li> | |
</ol> | |
<p>The non-special characters are just a part of the token in which they | |
appear, as if they were no different from a letter or number. But the spaces | |
between the words still separate tokens according to <a class="section-link" href="#space_separated">rule #2</a>.</p> | |
<h3>Rule #5.1: special tokens do not join with other tokens</h3> | |
<p>So <a class="section-link" href="#join_adjacent">rule #4</a> does not apply to special tokens. | |
Lets take a quick look at how special tokens are tokenized. </p> | |
<div><code>if(true);then{echo yes;}fi|cat -n;</code></div> | |
<p>In this example, there are several special characters used: | |
<code class="single-char">;</code>, | |
<code class="single-char">|</code>, | |
<code class="single-char">(</code>, | |
<code class="single-char">)</code>, | |
<code class="single-char">{</code>, and | |
<code class="single-char">}</code> (the hyphen | |
<code class="single-char">-</code> is not special). So how do you think this will tokenize? | |
</p> | |
<blockquote class="notice"> | |
<b>BE AWARE</b> that this example will <b>NOT</b> work with the <q><a href="#tokenizer.sh" class="file-name">tokenizer.sh</a></q> program. If | |
you do try it, it will report an error: | |
<pre> | |
<span class="output">bash: syntax error near unexpected token `('</span> | |
</pre> | |
</blockquote> | |
<p>The answer is that it will tokenize like this (but again don't try this with | |
<q><a href="#tokenizer.sh" class="file-name">tokenizer.sh</a></q>):</p> | |
<ol> | |
<li><code class="token">if</code></li> | |
<li><code class="token">(</code></li> | |
<li><code class="token">true</code></li> | |
<li><code class="token">)</code></li> | |
<li><code class="token">then</code></li> | |
<li><code class="token">{</code></li> | |
<li><code class="token">echo</code></li> | |
<li><code class="token">yes</code></li> | |
<li><code class="token">;</code></li> | |
<li><code class="token">}</code></li> | |
<li><code class="token">fi</code></li> | |
<li><code class="token">|</code></li> | |
<li><code class="token">cat</code></li> | |
<li><code class="token">-n</code></li> | |
<li><code class="token">;</code></li> | |
</ol> | |
<p>However these tokens are swept up by Bash and <b>immediately</b> crunched | |
into something else (in this case, a "conditional statement"), and this happens | |
even before the tokens are handed off to other programs like our | |
</q><a href="#tokenizer.sh" class="file-name">tokenizer.sh</a></q> program. There is a more advanced Bash grammar that | |
occurs after the tokenization step which allows you to control if and when | |
certain commands are run, which we will learn more about in another lesson.</p> | |
<hr /> | |
<h2><a name="backslash">Rule 6: Backslash makes a special punctuation mark ordinary</a></h2> | |
<p>The last rule to remember is that all of the above mentioned special | |
characters become ordinary tokens, or parts of tokens, if they follow a | |
backslash <code class="single-char">\</code>. For example, if you want a token | |
to contain an apostrophe without bash thinking it is a single-quote, you could | |
write this: | |
<div><code>I won\'t make that mistake again.</code></div> | |
<ol> | |
<li><code class="token">I</code></li> | |
<li><code class="token">won't</code></li> | |
<li><code class="token">make</code></li> | |
<li><code class="token">that</code></li> | |
<li><code class="token">mistake</code></li> | |
<li><code class="token">again.</code></li> | |
</ol> | |
<div><code>He won\'t do it because he doesn\'t even know how.</code></div> | |
<ol> | |
<li><code class="token">He</code></li> | |
<li><code class="token">won't</code></li> | |
<li><code class="token">do</code></li> | |
<li><code class="token">it</code></li> | |
<li><code class="token">because</code></li> | |
<li><code class="token">he</code></li> | |
<li><code class="token">doesn't</code></li> | |
<li><code class="token">even</code></li> | |
<li><code class="token">know</code></li> | |
<li><code class="token">how.</code></li> | |
</ol> | |
<p>Any character at all, even spaces and hash tags, can be made to be part of a | |
token with the backslash:</p> | |
<div><code>This\ is\ one\ long\ token. \# These are separate tokens.</code></div> | |
<ol> | |
<li><code class="token">This is one long token.</code></li> | |
<li><code class="token">#</code></li> | |
<li><code class="token">These</code></li> | |
<li><code class="token">are</code></li> | |
<li><code class="token">separate</code></li> | |
<li><code class="token">tokens.</code></li> | |
</ol> | |
<p>Notice above that there is no backslash right after the <code | |
class="token">token.</code> token, so the space is not "escaped" by the | |
backslash, and the token breaks there. All preceding tokens are joined | |
together into one large token. | |
</p> | |
<p>But a backslash is only good for one character:</p> | |
<div><code>With single-quotes '###' but with a backslash \### all the rest is ignored.</code></div> | |
<ol> | |
<li><code class="token">With</code></li> | |
<li><code class="token">single-quotes</code></li> | |
<li><code class="token">###</code></li> | |
<li><code class="token">but</code></li> | |
<li><code class="token">with</code></li> | |
<li><code class="token">a</code></li> | |
<li><code class="token">backslash</code></li> | |
<li><code class="token">#</code></li> | |
</ol> | |
<p>The backslash only worked it's magic on the first hash <code | |
class="single-char">#</code> character, the one after it was ignored.</p> | |
</p>How would we write an apostrophe in the middle of a single-quoted token? Like this: | |
<div><code>She said, 'Well isn'\''t that something!'</code></div> | |
<ol> | |
<li><code class="token">She</code> | |
<li><code class="token">said,</code> | |
<li><code class="token">Well isn't that something!</code> | |
</ol> | |
</p> | |
<p>Why? Because the string | |
<code class="user-input">'Well isn'\''t that something!'</code> | |
contains three tokens | |
<div> | |
<code class="token">Well isn</code> <code class="token">'</code> <code class="token">t that something!</code> | |
</div> | |
which are not separated by white spaces, so the three tokens are joined into | |
one according to <a class="section-link" href="#join_adjacent">rule #4</a>. | |
</p> | |
</p>And backslashes treat <em>themselves</em> as ordinary tokens as well. That is to | |
say, if a one backslash is followed by a second backslash, the second backslash | |
is treated as an ordinary token. For sequences of backslashes, every two backslash | |
<code class="single-char">\</code><code class="single-char">\</code> | |
characters become a single backslash character. | |
<code class="single-char">\</code> | |
<div><code>1 \\ 2 \\\\ 3 \\\\\\ 4 \\\\\\\\ 5 \\\\\\\\\\</code></div> | |
<ol> | |
<li><code class="token">1</code> | |
<li><code class="token">\</code> | |
<li><code class="token">2</code> | |
<li><code class="token">\\</code> | |
<li><code class="token">3</code> | |
<li><code class="token">\\\</code> | |
<li><code class="token">4</code> | |
<li><code class="token">\\\\</code> | |
<li><code class="token">5</code> | |
<li><code class="token">\\\\\</code> | |
</ol> | |
</p> | |
<hr /> | |
<h2>Conclusion</h2> | |
<p>So those are the most fundamental tokenizer rules for bash. There will be more | |
rules, but these are the most important to remember. Usually, if we just never | |
use file names with spaces or punctuation marks in them, we never have to worry | |
about single-quoting or backslashes, and our life becomes easier. We can just | |
write tokens as they are and Bash will work as we expect it to.</p> | |
<p>This is why Linux and UNIX programmers like to name files like this: | |
<q><code>a-file-name-should-never-have-spaces.txt</code></q>. Because they use | |
Bash, and working with file names in Bash can get a bit tedious if they have | |
spaces or special characters in their name.</p> | |
<p>So here are all the basic tokenizer rules in Bash in a handy table which you | |
may want to keep in your notebook.</p> | |
<table> | |
<tr><td>1. Ignore hash tags</td><td><code>token token token # ignored ignored ignored</code></td></tr> | |
<tr><td>2. tokens are separated by spaces</td><td><code>this sentence has 7 tokens in it</code></td></tr> | |
<tr><td>3. tokens can have spaces.</td><td><code>'this is just one token'</code> <code>this\ is\ also\ just\ one\ token</code></td></tr> | |
<tr><td>4. tokens not separated by spaces are joined together into a single token.</td><td><code>firsttoken</code> <code>second' token'</code> <code>'third''token'</code> <code>fourth\ 'token'</code></td></tr> | |
<tr><td>5. Most punctuation marks have special meaning.</td><td>Special characters are: <code class="token">#</code> <code class="token">'</code> <code class="token">"</code> <code class="token">`</code> <code class="token">\</code> <code class="token">%</code> <code class="token">$</code> <code class="token">*</code> <code class="token">&</code> <code class="token">;</code> <code class="token">!</code> <code class="token">~</code> <code class="token"><</code> <code class="token">></code> <code class="token">=</code> <code class="token">|</code> <code class="token">(</code> <code class="token">)</code> <code class="token">{</code> <code class="token">}</code></td></tr> | |
<tr><td>6. The backslash makes special characters ordinary.</td><td>You can enter <q><code>\#</code></q> or <q><code>\'</code></q> to use those characters alone.</td></tr> | |
</table> | |
<hr /> | |
<H5>Copyright © Ramin Honary 2017. This document is published under the <a | |
href="https://creativecommons.org/license/by-nc/3.0/legalcode">Creative Commons | |
Attribution-NonCommercial 3.0 Unported</a> license. The markup source code is | |
available on GitHub at <a | |
href="https://gist.github.com/RaminHAL9001/22bcb9c32786f089fb973444adfa0619">https://gist.github.com/RaminHAL9001/22bcb9c32786f089fb973444adfa0619</a>.</H4> | |
</div> | |
</body></html> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment